Estimating per-residue binding frequencies with InteracDome (Wed, Jul 22, 2020)
A blog post detailing InteracDome's integration into anvi'o
The latest version of anvi’o is v7
. See the release notes.
Here you will find the current anvi’o programs in the latest stable version of the platform, and their help menu. The contents of this file was last updated on 21 Jun 20 14:19:09, and then anvi’o looked like this:
Key | Value |
---|---|
Anvi'o version | esther (v6.2-master) |
Profile DB version | 34 |
Contigs DB version | 18 |
Genes DB version | 6 |
Auxiliary data storage version | 2 |
Pan DB version | 14 |
Genome data storage version | 7 |
Structure DB version | 2 |
KEGG Modules DB version | 2 |
Main anvi’o programs (97) anvi-analyze-synteny, anvi-cluster-contigs, anvi-compute-completeness, anvi-compute-gene-cluster-homogeneity, anvi-compute-genome-similarity, anvi-db-info, anvi-delete-collection, anvi-delete-hmms, anvi-delete-misc-data, anvi-delete-state, anvi-dereplicate-genomes, anvi-display-contigs-stats, anvi-display-metabolism, anvi-display-pan, anvi-display-structure, anvi-estimate-genome-completeness, anvi-estimate-metabolism, anvi-estimate-scg-taxonomy, anvi-experimental-organization, anvi-export-collection, anvi-export-contigs, anvi-export-functions, anvi-export-gene-calls, anvi-export-gene-coverage-and-detection, anvi-export-items-order, anvi-export-locus, anvi-export-misc-data, anvi-export-splits-and-coverages, anvi-export-splits-taxonomy, anvi-export-state, anvi-export-structures, anvi-export-table, anvi-gen-contigs-database, anvi-gen-fixation-index-matrix, anvi-gen-gene-consensus-sequences, anvi-gen-gene-level-stats-databases, anvi-gen-genomes-storage, anvi-gen-network, anvi-gen-phylogenomic-tree, anvi-gen-structure-database, anvi-gen-variability-matrix, anvi-gen-variability-network, anvi-gen-variability-profile, anvi-get-aa-counts, anvi-get-codon-frequencies, anvi-get-enriched-functions-per-pan-group, anvi-get-sequences-for-gene-calls, anvi-get-sequences-for-gene-clusters, anvi-get-sequences-for-hmm-hits, anvi-get-short-reads-from-bam, anvi-get-short-reads-mapping-to-a-gene, anvi-get-split-coverages, anvi-help, anvi-import-collection, anvi-import-functions, anvi-import-items-order, anvi-import-misc-data, anvi-import-state, anvi-import-taxonomy-for-genes, anvi-import-taxonomy-for-layers, anvi-init-bam, anvi-inspect, anvi-interactive, anvi-matrix-to-newick, anvi-mcg-classifier, anvi-merge, anvi-merge-bins, anvi-meta-pan-genome, anvi-migrate, anvi-oligotype-linkmers, anvi-pan-genome, anvi-profile, anvi-push, anvi-refine, anvi-rename-bins, anvi-report-linkmers, anvi-run-hmms, anvi-run-kegg-kofams, anvi-run-ncbi-cogs, anvi-run-pfams, anvi-run-scg-taxonomy, anvi-run-workflow, anvi-scan-trnas, anvi-search-functions, anvi-self-test, anvi-setup-kegg-kofams, anvi-setup-ncbi-cogs, anvi-setup-pdb-database, anvi-setup-pfams, anvi-setup-scg-taxonomy, anvi-show-collections-and-bins, anvi-show-misc-data, anvi-split, anvi-summarize, anvi-update-db-description, anvi-update-structure-database, anvi-upgrade.
Ad hoc anvi’o scripts (30) anvi-script-add-default-collection, anvi-script-augustus-output-to-external-gene-calls, anvi-script-calculate-pn-ps-ratio, anvi-script-checkm-tree-to-interactive, anvi-script-compute-ani-for-fasta, anvi-script-estimate-genome-size, anvi-script-filter-fasta-by-blast, anvi-script-gen-CPR-classifier, anvi-script-gen-distribution-of-genes-in-a-bin, anvi-script-gen-help-pages, anvi-script-gen-hmm-hits-matrix-across-genomes, anvi-script-gen-programs-network, anvi-script-gen-programs-vignette, anvi-script-gen-pseudo-paired-reads-from-fastq, anvi-script-gen-scg-domain-classifier, anvi-script-gen-short-reads, anvi-script-gen_stats_for_single_copy_genes.py, anvi-script-get-coverage-from-bam, anvi-script-get-hmm-hits-per-gene-call, anvi-script-get-short-reads-matching-something, anvi-script-merge-collections, anvi-script-predict-CPR-genomes, anvi-script-process-genbank, anvi-script-process-genbank-metadata, anvi-script-reformat-fasta, anvi-script-run-eggnog-mapper, anvi-script-snvs-to-interactive, anvi-script-tabulate, anvi-script-transpose-matrix, anvi-script-variability-to-vcf.
Please let us know if there is something unclear in this output.
Extract ngrams, as in 'co-occurring genes in synteny', from genomes
Usage
anvi-analyze-synteny [-h] -g GENOMES_STORAGE
[--ngram-window-range NGRAM_WINDOW_RANGE]
[-o FILE_PATH] [--annotation-source SOURCE NAME]
[-p PAN_DB] [-n NGRAM_SOURCE] [-l]
[--analyze-unknown-functions] [-G GENOME_NAMES]
Parameters
Essential INPUT:
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
--ngram-window-range NGRAM_WINDOW_RANGE
The range of window sizes of Ngrams to analyze for
synteny patterns.Please format the window-range as x:y
(e.g. Window sizes 2 to 4 would be denoted as: 2:4)
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Annotation sources for Ngrams: Choose one source of annotations for your Ngrams.
--annotation-source SOURCE NAME
Get functional annotations for a specific annotation
source. You can use the flag '--list-annotation-
sources' to learn about what sources are available.
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-n NGRAM_SOURCE, --ngram-source NGRAM_SOURCE
If two annotation sources are provided, please choose
one annotation source that will be used to calcuate
Ngrams (e.g. gene_clusters, COG_FUNCTION)
Optional arguments:
-l, --list-annotation-sources
List available functional annotation sources.
--analyze-unknown-functions
Provide this flag if you want anvi-analyze-synteny to
report Ngrams that contain gene calls that have no
annotation.
-G GENOME_NAMES, --genome-names GENOME_NAMES
Genome names to 'focus'. You can use this parameter to
limit the genomes included in your analysis. You can
provide these names as a comma-separated list of
names, or you can put them in a file, where you have a
single genome name in each line, and provide the file
path.
A program to cluster items in a merged anvi'o profile using automatic binning algorithms.
profile_db
clustering
collections
Usage
anvi-cluster-contigs [-h] -p PROFILE_DB -c CONTIGS_DB -C
COLLECTION_NAME --driver DRIVER [-T NUM_THREADS]
[--just-do-it]
Parameters
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--driver DRIVER Automatic binning drivers. Available options 'concoct,
metabat2, maxbin2, dastool, binsanity'.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--just-do-it Don't bother me with questions or warnings, just do
it.
CONCOCT
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
--clusters INT specify maximal number of clusters for VGMM, default
400
--kmer-length INT pecify kmer length, default 4
--length-threshold INT
specify the sequence length threshold, contigs shorter
than this value will not be included. Defaults to 1000
--read-length INT specify read length for coverage, default 100
--no-cov-normalization
By default the coverage is normalized with regards to
samples, then normalized with regards of contigs and
finally log transformed. By setting this flag you skip
the normalization and only do log transorm of the
coverage.
--total-percentage-pca INT
The percentage of variance explained by the principal
components for the combined data.
--epsilon FLOAT Specify the epsilon for VBGMM. Default value is 1.0e-6
--iterations INT Specify maximum number of iterations for the VBGMM.
Default value is 500
--seed INT Specify an integer to use as seed for clustering. 0
gives a random seed, 1 is the default seed and any
other positive integer can be used. Other values give
ArgumentTypeError.
METABAT2 [NOT FOUND]
MAXBIN2 [NOT FOUND]
DASTOOL
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
-S COLLECTION_LIST, --source-collections COLLECTION_LIST
Comma-separated list of collections, case sensitive.
--search-engine PROGRAM
Engine used for single copy gene identification
[blast/diamond/usearch]. (default: usearch)
--score-threshold FLOAT
Score threshold until selection algorithm will keep
selecting bins [0..1]. (default: 0.5)
--duplicate-penalty FLOAT
Penalty for duplicate single copy genes per bin
(weight b). Only change if you know what you're doing.
[0..3] (default: 0.6)
--megabin-penalty FLOAT
Penalty for megabins (weight c). Only change if you
know what you're doing. [0..3] (default: 0.5)
--db-directory PATH Directory of single copy gene database. (default:
install_dir/db)
BINSANITY
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
--preference INT Specify a preference (default is -3) Note: decreasing the
preference leads to more lumping, increasing will lead to
more splitting. If your range of coverages are low you
will want to decrease the preference, if you have 10 or
less replicates increasing the preference could benefit
you.
--maxiter INT Specify a max number of iterations [default is 2000]
--conviter INT Specify the convergence iteration number (default is 200)
e.g Number of iterations with no change in the number of
estimated clusters that stops the convergence.
--damp FLOAT Specify a damping factor between 0.5 and 1, default is 0.9
--contigsize INT TNF probability cutoff for building TNF graph. Use it to
skip the preparation step. (0: auto).
A script to generate completeness info for a given list of splits
Usage
anvi-compute-completeness [-h] [--splits-of-interest FILE] -c
CONTIGS_DB [-e E-VALUE]
[--list-completeness-sources]
[--completeness-source NAME]
Parameters
optional arguments:
--splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-e E-VALUE, --min-e-value E-VALUE
Minimum significance score of an HMM find to be
considered as a valid hit. Default is 1e-15.
--list-completeness-sources
Show available sources and exit.
--completeness-source NAME
Single-copy gene source to use to estimate
completeness.
Compute homogeneity for gene clusters
Usage
anvi-compute-gene-cluster-homogeneity [-h] -p PAN_DB
[-g GENOMES_STORAGE]
[-o FILE_PATH] [--store-in-db]
[--gene-cluster-id GENE_CLUSTER_ID]
[--gene-cluster-ids-file FILE_PATH]
[-C COLLECTION_NAME]
[-b BIN_NAME]
[--quick-homogeneity]
[-T NUM_THREADS] [--just-do-it]
Parameters
INPUT FILES: Input files from the pangenome analysis.
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
REPORTING: How do you want results to be reported? Anvi'o can produce a TAB-delimited output file for you (for which you would have to provide an output file name). Or the results can be stored in the pan database directly, for which you would have to explicitly ask for it. You can get both as well in case you are a fan of redundancy and poor data analysis practices. Anvi'o does not judge.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--store-in-db Store analysis results into the database directly.
SELECTION: Which gene clusters should be analyzed. You can ask for a single gene cluster, or multiple ones listed in a file, or you can use a collection and bin name to list gene clusters of interest.
--gene-cluster-id GENE_CLUSTER_ID
Gene cluster ID you are interested in.
--gene-cluster-ids-file FILE_PATH
Text file for gene clusters (each line should contain
be a unique gene cluster id).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
OPTIONAL: Optional stuff available for you to use
--quick-homogeneity By default, anvi'o will use a homogeneity algorithm
that checks for horizontal and vertical geometric
homogeneity (along with functional). With this flag,
you can tell anvi'o to skip horizontal geometric
homogeneity calculations. It will be less accurate but
quicker.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--just-do-it Don't bother me with questions or warnings, just do
it.
Export sequences from sequence sources and compute a similarity metric (e.g. ANI). If a Pan Database is given anvi'o will write computed output to misc data tables of Pan Database.
ani
dereplication
redundancy
Usage
anvi-compute-genome-similarity [-h] [-i FILE_PATH] [-e FILE_PATH]
[-f FASTA_TEXT_FILE] -o DIR_PATH
[-p PAN_DB]
[--program {pyANI,fastANI,sourmash}]
[--fastani-kmer-size FASTANI_KMER_SIZE]
[--fragment-length FRAGMENT_LENGTH]
[--min-num-fragments MIN_NUM_FRAGMENTS]
[--method {ANIm,ANIb,ANIblastall,TETRA}]
[--min-alignment-fraction NUM]
[--significant-alignment-length INT]
[--min-full-percent-identity FULL_PERCENT_IDENTITY]
[--kmer-size INT] [--scale INT]
[--distance DISTANCE_METRIC]
[--linkage LINKAGE_METHOD]
[-T NUM_THREADS] [--just-do-it]
[--log-file FILE_PATH]
Parameters
INPUT OPTIONS: Tell anvi'o what you want.
-i FILE_PATH, --internal-genomes FILE_PATH
A five-column TAB-delimited flat text file. The header
line must contain these columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
line should read 'name', and the second should read
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
-f FASTA_TEXT_FILE, --fasta-text-file FASTA_TEXT_FILE
A two-column TAB-delimited file that lists multiple
FASTA files to import for analysis. If using for
`anvi-dereplicate-genomes` or `anvi-compute-distance`,
each FASTA is assumed to be a genome. The first item
in the header line should read 'name', and the second
item should read 'path'. Each line in the field should
describe a single entry, where the first column is the
name of the FASTA file or corresponding sequence, and
the second column is the path to the FASTA file
itself.
OUTPUT OPTIONS: Tell anvi'o where to store your results.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-p PAN_DB, --pan-db PAN_DB
This is totally optional, but very useful when
applicable. If you are running this for genomes for
which you already have an anvi'o pangeome, then you
can show where the pan database is and anvi'o would
automatically add the results into the misc data
tables of your pangenome. Those data can then be shown
as heatmaps on the pan interactive interface through
the 'layers' tab.
Program: Tell anvi'o which similarity program to run.
--program {pyANI,fastANI,sourmash}
Tell anvi'o which program to run to process genome
similarity. For ANI, you should either use pyANI or
fastANI. If accuracy is paramount (for example,
distinguishing things less than 1 percent different),
or for dealing with genomes < 80 percent similar,
pyANI is what we recommend. However, fastANI is much
faster. If you for some reason want to use mash
similarity, you can use sourmash, but its really not
intended for genome comparisons. If you don't choose
anything here, anvi'o will reluctantly set the program
to pyANI, but you really should be the one who is on
top of these things.
fastANI Settings: Tell anvi'o to tell fastANI what settings to set. Only if --program
is
set to pyANI
--fastani-kmer-size FASTANI_KMER_SIZE
Choose a kmer. The default is 16.
--fragment-length FRAGMENT_LENGTH
Choose a fragment length. The default is 3000.
--min-num-fragments MIN_NUM_FRAGMENTS
Choose the minimum number of fragment lengths to that
can can be trusted. The default is 50.
pyANI Settings: Tell anvi'o to tell pyANI what method you wish to use and what settings to
set. Only if --program
is set to pyANI
--method {ANIm,ANIb,ANIblastall,TETRA}
Method for pyANI. The default is ANIb. You must have
the necessary binary in path for whichever method you
choose. According to the pyANI help for v0.2.7 at
https://github.com/widdowquinn/pyani, the method
'ANIm' uses MUMmer (NUCmer) to align the input
sequences. 'ANIb' uses BLASTN+ to align 1020nt
fragments of the input sequences. 'ANIblastall': uses
the legacy BLASTN to align 1020nt fragments Finally,
'TETRA': calculates tetranucleotide frequencies of
each input sequence
--min-alignment-fraction NUM
In some cases you may get high raw ANI estimates
(percent identity scores) between two genomes that
have little to do with each other simply because only
a small fraction of their content may be aligned. This
filter will set all ANI scores between two genomes to
0 if the alignment fraction is less than you deem
trustable. When you set a value, anvi'o will go
through the ANI results, and set percent identity
scores between two genomes to 0 if the alignment
fraction *between either of them* is less than the
parameter described here. The default is 0.
--significant-alignment-length INT
So --min-alignment-fraction discards any hit that is
coming from alignments that represent shorter
fractions of genomes, but what if you still don't want
to miss an alignment that is longer than an X number
of nucleotides regardless of what fraction of the
genome it represents? Well, this parameter is to
recover things that may be lost due to --min-
alignment-fraction parameter. Let's say, if you set
--min-alignment-fraction to '0.05', and this parameter
to '5000', anvi'o will keep hits from alignments that
are longer than 5000 nts, EVEN IF THEY REPRESENT less
than 5 percent of a given genome pair. Basically if
--min-alignment-fraction is your shield to protect
yourself from incoming garbage, --significant-
alignment-length is your chopstick to pick out those
that may be interesting, and you are a true warrior
here.
--min-full-percent-identity FULL_PERCENT_IDENTITY
In some cases you may get high raw ANI estimates
(percent identity scores) between two genomes that
have little to do with each other simply because only
a small fraction of their content may be aligned. This
can be partly alleviated by considering the *full*
percent identity, which includes in its calculation
regions that did not align. For example, if the
alignment is a whopping 97 percent identity but only 8
percent of the genome aligned, the *full* percent
identity is 0.970 * 0.080 = 0.078 OR 7.8 percent.
*full* percent identity is always included in the
report, but you can also use it as a filter for other
metrics, such as percent identity. This filter will
set all ANI measures between two genomes to 0 if the
*full* percent identity is less than you deem
trustable. When you set a value, anvi'o will go
through the ANI results, and set all ANI measures
between two genomes to 0 if the *full* percent
identity *between either of them* is less than the
parameter described here. The default is 0.
Sourmash Settings: Tell anvi'o to tell sourmash what settings to set. Only if --program
is
set to sourmash
--kmer-size INT Set the k-mer size for mash similarity checks. We
found 13 in almost all cases correlates best with
alignment-based ANI.
--scale INT Set the compression ratio for fasta signature file
computations. The default is 1000. Smaller ratios
decrease sensitivity, while larger ratios will lead to
large fasta signatures.
HIERARCHICAL CLUSTERING: anvi-compute-genome-similarity outputs similarity matrix files, which can be clustered into nice looking dendrograms to display the relationships between genomes nicely (in the anvi'o interface and elsewhere). Here you can set the distance metric and the linkage algorithm for that.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
The default is "euclidean".
--linkage LINKAGE_METHOD
The linkage method for the hierarchical clustering.
The default is "ward".
OTHER IMPORTANT STUFF: Yes. You're almost done.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--just-do-it Don't bother me with questions or warnings, just do
it.
--log-file FILE_PATH File path to store debug/output messages.
Access self tables, display values, or set new ones totally on your own risk.
Usage
anvi-db-info [-h] [--self-key SELF_KEY] [--self-value SELF_VALUE]
[--just-do-it]
DATABASE_PATH
Parameters
Input: The database path you wish to access.
DATABASE_PATH An anvi'o database for pan, profile, contigs, or
auxiliary data
Very dangerous zone: For power users with extreme self-control and maturity.
--self-key SELF_KEY The key you wish to set or change.
--self-value SELF_VALUE
The value you wish to set for the self key.
--just-do-it Don't bother me with questions or warnings, just do
it.
Remove a collection from a given profile database.
Usage
anvi-delete-collection [-h] -p PROFILE_DB [-C COLLECTION_NAME]
[--list-collections]
Parameters
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--list-collections Show available collections and exit.
Remove HMM hits from an anvi'o contigs database.
Usage
anvi-delete-hmms [-h] -c CONTIGS_DB [--hmm-source SOURCE NAME] [-l]
[--just-do-it]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--hmm-source SOURCE NAME
Use a specific HMM source. You can use '--list-hmm-
sources' flag to see a list of available resources.
The default is 'None'.
-l, --list-hmm-sources
List available HMM sources in the contigs database and
quit.
--just-do-it Don't bother me with questions or warnings, just do
it.
Remove stuff from 'additional data' or 'order' tables for either items or layers in either pan or profile databases. OR, remove stuff from the 'additional data' tables for nucleotides or amino acids in contigs databases.
Usage
anvi-delete-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB] -t
NAME [--keys-to-remove KEYS_TO_REMOVE]
[--groups-to-remove GROUPS_TO_REMOVE]
[--list-available-keys] [--just-do-it]
Parameters
Database input: Provide 1 of these
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
Details: Everything else.
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
documentation for more information.
--keys-to-remove KEYS_TO_REMOVE
A comma-separated list of data keys to remove from the
database. If you do not use this parameter, anvi'o
will simply remove everything from the target data
table immediately. Please note that you should not use
this parameter together with `--groups-to-remove` in a
single command.
--groups-to-remove GROUPS_TO_REMOVE
A comma-separated list of data groups to remove from
the database. If you do not use this parameter, anvi'o
will simply remove everything from the target data
table immediately. Please note that you should not use
this parameter together with `--keys-to-remove` in a
single command.
--list-available-keys
Using this flag will list available data keys in the
target data table and quit without doing anything
else.
--just-do-it Don't bother me with questions or warnings, just do
it.
Delete an anvi'o state from a pan or profile database.
Usage
anvi-delete-state [-h] -p PAN_OR_PROFILE_DB [-s STATE_NAME]
[--list-states]
Parameters
optional arguments:
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-s STATE_NAME, --state STATE_NAME
The state name to ... delete :(
--list-states Show available states and exit.
Identify redundant (highly similar) genomes.
Usage
anvi-dereplicate-genomes [-h] [-i FILE_PATH] [-e FILE_PATH]
[-f FASTA_TEXT_FILE] [--ani-dir PATH]
[--mash-dir PATH] -o DIR_PATH
[--skip-fasta-report] [--report-all]
[--program {pyANI,fastANI,sourmash}]
[--fastani-kmer-size FASTANI_KMER_SIZE]
[--fragment-length FRAGMENT_LENGTH]
[--min-fraction MIN_FRACTION]
[--method {ANIm,ANIb,ANIblastall,TETRA}]
[--min-alignment-fraction NUM]
[--significant-alignment-length INT]
[--use-full-percent-identity]
[--min-full-percent-identity FULL_PERCENT_IDENTITY]
[--kmer-size INT] [--scale INT]
--similarity-threshold SIMILARITY_THRESHOLD
[--cluster-method {simple_greedy}]
[--representative-method {Qscore,length,centrality}]
[-T NUM_THREADS] [--just-do-it]
[--skip-checking-genome-hashes]
[--log-file FILE_PATH]
Parameters
INPUT OPTIONS: Tell anvi'o what you want.
-i FILE_PATH, --internal-genomes FILE_PATH
A five-column TAB-delimited flat text file. The header
line must contain these columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
line should read 'name', and the second should read
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
-f FASTA_TEXT_FILE, --fasta-text-file FASTA_TEXT_FILE
A two-column TAB-delimited file that lists multiple
FASTA files to import for analysis. If using for
`anvi-dereplicate-genomes` or `anvi-compute-distance`,
each FASTA is assumed to be a genome. The first item
in the header line should read 'name', and the second
item should read 'path'. Each line in the field should
describe a single entry, where the first column is the
name of the FASTA file or corresponding sequence, and
the second column is the path to the FASTA file
itself.
IMPORT RESULTS: Alternatively, if you have previous ANI or mash similarity computations on your genomes, you can import the result directory here to use. Please note that file names must remain unchanged for anvi'o to find them
--ani-dir PATH You can import the directory created by `anvi-compute-
genome-similarity` if `--program` parameter was set to
`fastANI` or `pyANI` and use it for dereplication
--mash-dir PATH You can import the directory created by `anvi-compute-
genome-similarity` if `--program` parameter was set to
`sourmash` and use it for dereplication
OUTPUT OPTIONS: Tell anvi'o where to store your results.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
--skip-fasta-report By default, if any sequence source is provided, FASTA
files of non-redundant genomes are reported. With this
flag, no FASTA files are reported.
--report-all By default, only FASTA files of non-redundant genomes
are reported, i.e. single representatives from each
cluster. With this flag, all genome FASTAS will be
reported.
Program: Tell anvi'o which similarity program to run.
--program {pyANI,fastANI,sourmash}
Tell anvi'o which program to run to process genome
similarity. For ANI, you can either use pyANI or
fastANI. If accuracy is paramount (for example,
distinguishing things less than 1 percent different),
or for dealing with genomes < 80 percent similar,
pyANI is what we recommend. However, fastANI is much
faster. If you for some reason want to use mash
similarity, you can use sourmash, but its really not
intended for genome comparisons.
fastANI Settings: Tell anvi'o to tell fastANI what settings to set. Only if --program
is
set to fastANI
--fastani-kmer-size FASTANI_KMER_SIZE
Choose a kmer. The default is 16.
--fragment-length FRAGMENT_LENGTH
Choose a fragment length. The default is 3000.
--min-fraction MIN_FRACTION
Minimum fraction of alignment to be shared between
genome pairs to calculate ANI. If reference and query
genome size differ, smaller one among the two is
considered. The default is 0.25.
pyANI Settings: Tell anvi'o to tell pyANI what method you wish to use and what settings to
set. Only if --program
is set to pyANI
--method {ANIm,ANIb,ANIblastall,TETRA}
Method for pyANI. The default is ANIb. You must have
the necessary binary in path for whichever method you
choose. According to the pyANI help for v0.2.7 at
https://github.com/widdowquinn/pyani, the method
'ANIm' uses MUMmer (NUCmer) to align the input
sequences. 'ANIb' uses BLASTN+ to align 1020nt
fragments of the input sequences. 'ANIblastall': uses
the legacy BLASTN to align 1020nt fragments Finally,
'TETRA': calculates tetranucleotide frequencies of
each input sequence
--min-alignment-fraction NUM
In some cases you may get high raw ANI estimates
(percent identity scores) between two genomes that
have little to do with each other simply because only
a small fraction of their content may be aligned. This
filter will set all ANI scores between two genomes to
0 if the alignment fraction is less than you deem
trustable. When you set a value, anvi'o will go
through the ANI results, and set percent identity
scores between two genomes to 0 if the alignment
fraction *between either of them* is less than the
parameter described here. The default is 0.25.
--significant-alignment-length INT
So --min-alignment-fraction discards any hit that is
coming from alignments that represent shorter
fractions of genomes, but what if you still don't want
to miss an alignment that is longer than an X number
of nucleotides regardless of what fraction of the
genome it represents? Well, this parameter is to
recover things that may be lost due to --min-
alignment-fraction parameter. Let's say, if you set
--min-alignment-fraction to '0.05', and this parameter
to '5000', anvi'o will keep hits from alignments that
are longer than 5000 nts, EVEN IF THEY REPRESENT less
than 5 percent of a given genome pair. Basically if
--min-alignment-fraction is your shield to protect
yourself from incoming garbage, --significant-
alignment-length is your chopstick to pick out those
that may be interesting, and you are a true warrior
here.
--use-full-percent-identity
Usually, percent identity is calculated only over
aligned regions, and this is what is used as a
distance metric by default. But with this flag, you
can instead use the *full* percent identity as the
distance metric. It is the same as percent identity,
except that regions that did not align are included in
the calculation. This means *full* percent identity
will always be less than or equal to percent identity.
How is it calculated? Well if P is the percentage
identity calculated in aligned regions, L is the
length of the genome, and A is the fraction of the
genome that aligned to a compared genome, the full
percent identity is P * (A/L). In other words, it is
the percent identity multiplied by the alignment
coverage. For example, if the alignment is a whopping
97 percent identity but only 8 percent of the genome
aligned, the *full* percent identity is 0.970 * 0.080
= 0.078, which is just 7.8 percent.
--min-full-percent-identity FULL_PERCENT_IDENTITY
In some cases you may get high raw ANI estimates
(percent identity scores) between two genomes that
have little to do with each other simply because only
a small fraction of their content may be aligned. This
can be partly alleviated by considering the *full*
percent identity, which includes in its calculation
regions that did not align. For example, if the
alignment is a whopping 97 percent identity but only 8
percent of the genome aligned, the *full* percent
identity is 0.970 * 0.080 = 0.078 OR 7.8 percent.
*full* percent identity is always included in the
report, but you can also use it as a filter for other
metrics, such as percent identity. This filter will
set all ANI measures between two genomes to 0 if the
*full* percent identity is less than you deem
trustable. When you set a value, anvi'o will go
through the ANI results, and set all ANI measures
between two genomes to 0 if the *full* percent
identity *between either of them* is less than the
parameter described here. The default is 20.
sourmash settings: Tell anvi'o to run sourmash with specific settings. Only if --program
is
set to sourmash
--kmer-size INT Set the k-mer size for mash similarity checks. The
default is 13.
--scale INT Set the compression ratio for fasta signature file
computations. The default is 1000. Smaller ratios
decrease sensitivity, while larger ratios will lead to
large fasta signatures.
Dereplication Parameters: Some parameters to guide your dereplication
--similarity-threshold SIMILARITY_THRESHOLD
If two genomes have a similarity greater than or equal
to this threshold, they will belong to the same
cluster. Since measures of 'similarity' depend
strongly on what method is used for calculation, and
since the threshold at which two genomes should be
considered 'similar enough' to be considered redundant
will depend on the application, anvi'o refuses to
provide a default parameter. If you're using pyANI,
maybe 0.90 is what you're after. If you're using
sourmash, maybe 0.25 is what you're after. Or maybe
not? Anvi'o is feeling nervous about this decision.
--cluster-method {simple_greedy}
Currently, genomes are clustered based on a simple
greedy algorithm. Let's say your similarity threshold
is 0.90. If genome A is 0.95 similar to B, and B is
0.95 similar to C, and C is 0.95 similar to D, then
{A,B,C,D} will form a cluster. This is *even though* D
may share a similarity to A of merely 0.80, which is
below similarity threshold. You want better
alternatives? Contact the developers.
--representative-method {Qscore,length,centrality}
After genomes are grouped into redundancy clusters,
you can define how anvi'o picks the representative
genome from the cluster. 'Qscore' computes the genome
with the highest completion and lowest redundancy as
the representative. 'length' returns the longest
genome. 'centrality' returns the genome with the
highest average similarity to everything in the
cluster, i.e. the most central. The default is
centrality
OTHER IMPORTANT STUFF: Yes. You're almost done.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--just-do-it Don't bother me with questions or warnings, just do
it.
--skip-checking-genome-hashes
Use this flag if you would like anvi'o to skip
checking genome hashes. This is only relevant if you
may have genomes in your internal or external genomes
files that have identical sequences with different
names AND if you are OK with it. You may be OK with
it, for instance, if you are using `anvi-dereplicate-
genomes` program to dereplicate genomes desribed in
multiple collections in an anvi'o profile database
that may be describing the same genome multiple times
(see https://github.com/merenlab/anvio/issues/1397 for
a case).
--log-file FILE_PATH File path to store debug/output messages.
Start the anvi'o interactive interactive for viewing or comparing contigs statistics
Usage
anvi-display-contigs-stats [-h] [--report-as-text] [-o FILE_PATH]
[-I IP_ADDR] [-P INT] [--browser-path PATH]
[--server-only] [--password-protected]
CONTIG DATABASES) [CONTIG DATABASE(S ...]
Parameters
positional arguments:
CONTIG DATABASE(S) Anvio'o Contig databases to display statistics, you
can give multiple databases by seperating them with
space.
REPORT CONFIGURATION: Specify what kind of output you want.
--report-as-text If you give this flag, Anvi'o will not open new
browser to show Contigs database statistics and write
all stats to TAB separated file and you should also
give --output-file with this flag otherwise Anvi'o
will complain.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
SERVER CONFIGURATION: For power users.
-I IP_ADDR, --ip-address IP_ADDR
IP address for the HTTP server. The default ip address
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--server-only The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.
--password-protected If this flag is set, command line tool will ask you to
enter a password and interactive interface will be
only accessible after entering same password. This
option is recommended for shared machines like
clusters or shared networks where computers are not
isolated.
Start the anvi'o interactive interactive for viewing KEGG metabolism data
Usage
anvi-display-metabolism [-h] -c CONTIGS_DB [-m]
[--kegg-data-dir KEGG_DATA_DIR] [-p PROFILE_DB]
[-C COLLECTION_NAME] [-b BIN_NAME]
[-B FILE_PATH]
[--module-completion-threshold NUM]
[-I IP_ADDR] [-P INT] [--browser-path PATH]
[--server-only] [--password-protected]
Parameters
INPUT: The minimum you must provide this program is a contigs database. In which
case anvi'o will attempt to estimate and display metabolism for all
contigs in it, assuming that the contigs database represents a single
genome. If the contigs database is actually a metagenome, you should use
the --metagenome
flag to explicitly declare that.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-m, --metagenome-mode
Treat a given contigs database as a metagenome rather
than treating it as a single genome.
--kegg-data-dir KEGG_DATA_DIR
The directory path for your KEGG setup, which will
include things like KOfam profiles and KEGG MODULE
data. Anvi'o will try to use the default path if you
do not specify anything.
ADDITIONAL INPUT: If you also provide a profile database AND a collection name, anvi'o will
estimate metabolism separately for each bin in your collection. You can
also limit those estimates to a specific bin or set of bins in the
collection using the parameters --bin-id
or --bin-ids-file
,
respectively.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
OUTPUT: Parameters for controlling estimation output. The output will be TAB- delimited files which by default are prefixed with 'kegg-metabolism', but you can of course change that name here.
--module-completion-threshold NUM
This threshold defines the point at which we consider
a KEGG module to be 'complete' or 'present' in a given
genome or bin. It is the fraction of steps that must
be complete in in order for the entire module to be
marked complete. The default is 0.75.
SERVER CONFIGURATION: For power users.
-I IP_ADDR, --ip-address IP_ADDR
IP address for the HTTP server. The default ip address
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--server-only The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.
--password-protected If this flag is set, command line tool will ask you to
enter a password and interactive interface will be
only accessible after entering same password. This
option is recommended for shared machines like
clusters or shared networks where computers are not
isolated.
Start an anvi'o server to display a pan-genome
Usage
anvi-display-pan [-h] -p PAN_DB [-g GENOMES_STORAGE] [-d VIEW_DATA]
[-t NEWICK] [-V ADDITIONAL_VIEW]
[-A ADDITIONAL_LAYERS] [--view NAME] [--title NAME]
[--state-autoload NAME] [--collection-autoload NAME]
[--export-svg FILE_PATH] [--skip-init-functions]
[--dry-run] [--skip-auto-ordering] [-I IP_ADDR]
[-P INT] [--browser-path PATH] [--read-only]
[--server-only] [--password-protected]
[--user-server-shutdown]
Parameters
INPUT FILES: Input files from the pangenome analysis.
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
OPTIONAL INPUTS: Where the yay factor becomes a reality.
-d VIEW_DATA, --view-data VIEW_DATA
A TAB-delimited file for view data
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure
ADDITIONAL STUFF: Parameters to provide additional layers, views, or layer data.
-V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
A TAB-delimited file for an additional view to be used
in the interface. This file should contain all split
names, and values for each of them in all samples.
Each column in this file must correspond to a sample
name. Content of this file will be called 'user_view',
which will be available as a new item in the 'views'
combo box in the interface
-A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.
VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.
--view NAME Start the interface with a pre-selected view. To see a
list of available views, use --show-views flag.
--title NAME Title for the interface. If you are working with a
RUNINFO dict, the title will be determined based on
information stored in that file. Regardless, you can
override that value using this parameter.
--state-autoload NAME
Automatically load previous saved state and draw tree.
To see a list of available states, use --show-states
flag.
--collection-autoload NAME
Automatically load a collection and draw tree. To see
a list of available collections, use --list-
collections flag.
--export-svg FILE_PATH
The SVG output file path.
SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).
--skip-init-functions
When declared, function calls for genes will not be
initialized (therefore will be missing from all
relevant interfaces or output files). The use of this
flag may reduce the memory fingerprint and processing
time for large datasets.
--dry-run Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--skip-auto-ordering When declared, the attempt to include automatically
generated orders of items based on additional data is
skipped. In case those buggers cause issues with your
data, and you still want to see your stuff and deal
with the other issue maybe later.
SERVER CONFIGURATION: For power users.
-I IP_ADDR, --ip-address IP_ADDR
IP address for the HTTP server. The default ip address
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--read-only When the interactive interface is started with this
flag, all 'database write' operations will be
disabled.
--server-only The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.
--password-protected If this flag is set, command line tool will ask you to
enter a password and interactive interface will be
only accessible after entering same password. This
option is recommended for shared machines like
clusters or shared networks where computers are not
isolated.
--user-server-shutdown
Allow users to shutdown an anvi'server via web
interface.
Interactively visualize sequence variants on protein structures
Usage
anvi-display-structure [-h] -s STRUCTURE_DB [-p PROFILE_DB]
[-c CONTIGS_DB] [-V VARIABILITY_TABLE]
[--splits-of-interest FILE] [-C COLLECTION_NAME]
[-b BIN_NAME] [--samples-of-interest FILE]
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS] [-j FLOAT]
[--SAAVs-only] [--SCVs-only] [-I IP_ADDR]
[-P INT] [--browser-path PATH] [--server-only]
[--password-protected]
Parameters
STRUCTURE: Information related to the structure database, which can be created with anvi-gen-structure-database.
-s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.
VARIABILITY: We can overlay codon and amino acid variability in your metagenomes but we need a data source of this variability. Most simply, anvi'o can learn this information when you provide both your profile (-p) and contigs (-c) databases. Alternatively, you can provide a variability table output (-V) from the program anvi-gen-variability-profile. If you don't want to visualize variants, this is the wrong tool for the job. Instead, export the PDB files with anvi-export-structures, and open with a more comprehensive protein viewing software.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-V VARIABILITY_TABLE, --variability-profile VARIABILITY_TABLE
The output of anvi-gen-variability-profile, or a
different variant-calling output that has been
converted to the anvi'o format.
REFINING PARAMETERS: Which samples, genes, and contigs etc. are you interested in? Define that stuff here.
--splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
If provided, any genes found in both your bin and your
structure database will be available for display.
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
-j FLOAT, --min-departure-from-consensus FLOAT
Takes a value between 0 and 1, where 1 is maximum
divergence from the consensus. it can be an expensive
operation to display every variable position, and so
the default is 0.05. To display every variable
position, set this parameter to 0
--SAAVs-only If provided, variability will be generated for single
amino acid variants (SAAVs) and not for single codon
variants (SCVs). This could save you some time if
you're only interested in SAAVs.
--SCVs-only If provided, variability will be generated for single
codon variants (SCVs) and not for single amino acid
variants (SAAVs). This could save you some time if
you're only interested in SCVs.
SERVER CONFIGURATION: For power users.
-I IP_ADDR, --ip-address IP_ADDR
IP address for the HTTP server. The default ip address
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--server-only The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.
--password-protected If this flag is set, command line tool will ask you to
enter a password and interactive interface will be
only accessible after entering same password. This
option is recommended for shared machines like
clusters or shared networks where computers are not
isolated.
Estimate completion and redundancy using domain-specific single-copy core genes.
Usage
anvi-estimate-genome-completeness [-h] [-c CONTIGS_DB] [-e FILE_PATH]
[-p PROFILE_DB] [-C COLLECTION_NAME]
[--list-collections] [--just-do-it]
[--concise] [-o FILE_PATH]
Parameters
MANDATORY INPUT OPTION #1: Minimum input is an anvi'o contigs database. If you provide nothing else, anvi'o will assume that it is a single genome (even if it is not), and give you back what you need.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
MANDATORY INPUT OPTION #2: Or you can initiate this with an external genomes file.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
line should read 'name', and the second should read
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
ADDITIONAL INPUT (OPTIONAL): You can also give this program an anvi'o profile database along with a collection name. In which case anvi'o will estimate the completion and redundancy of every bin in this collection. Fun.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
PARAMETERS OF CONVENIENCE: Because life is already very hard as it is.
--list-collections Show available collections and exit.
--just-do-it Don't bother me with questions or warnings, just do
it.
--concise Don't be verbose, print less messages whenever
possible.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Reconstructs metabolic pathways and estimates pathway completeness for a given set of contigs
Usage
anvi-estimate-metabolism [-h] [-c CONTIGS_DB] [-m]
[--kegg-data-dir KEGG_DATA_DIR]
[-p PROFILE_DB] [-C COLLECTION_NAME]
[-b BIN_NAME] [-B FILE_PATH] [-e FILE_PATH]
[-i FILE_PATH] [-M FILE_PATH]
[--module-completion-threshold NUM]
[-O FILENAME_PREFIX]
[--kegg-output-modes MODES]
[--list-available-modes]
[--custom-output-headers HEADERS]
[--list-available-output-headers]
[--matrix-format]
[--get-raw-data-as-json FILENAME_PREFIX]
[--store-json-without-estimation]
[--estimate-from-json FILE_PATH]
Parameters
INPUT #1: The minimum you must provide this program is a contigs database. In which
case anvi'o will attempt to estimate metabolism for all contigs in it,
assuming that the contigs database represents a single genome. If the
contigs database is actually a metagenome, you should use the
--metagenome
flag to explicitly declare that.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-m, --metagenome-mode
Treat a given contigs database as a metagenome rather
than treating it as a single genome.
--kegg-data-dir KEGG_DATA_DIR
The directory path for your KEGG setup, which will
include things like KOfam profiles and KEGG MODULE
data. Anvi'o will try to use the default path if you
do not specify anything.
INPUT #2: If you also provide a profile database AND a collection name, anvi'o will
estimate metabolism separately for each bin in your collection. You can
also limit those estimates to a specific bin or set of bins in the
collection using the parameters --bin-id
or --bin-ids-file
,
respectively.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
INPUT #3: If you have multiple contigs databases to work with, you can put them all into a file. Then anvi'o will run estimation separately on each database and generate a single output file for all. There are 3 types of input files to choose from depending on whether you have single genomes (external), genomes in collections (internal), or metagenomes in your contigs DBs.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
line should read 'name', and the second should read
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
-i FILE_PATH, --internal-genomes FILE_PATH
A five-column TAB-delimited flat text file. The header
line must contain these columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.
-M FILE_PATH, --metagenomes FILE_PATH
A two-column TAB-delimited flat text file. The header
line must contain these columns: 'name',
'contigs_db_path', and 'profile_db_path'. Each line
should list a single entry, where 'name' can be any
name to describe the metagenome stored in the anvi'o
contigs database. In this context, the anvi'o profiles
associated with contigs database must be SINGLE
PROFILES, as in generated by the program `anvi-
profile` and not `anvi-merge`.
OUTPUT: Parameters for controlling estimation output. The output will be TAB- delimited files which by default are prefixed with 'kegg-metabolism', but you can of course change that name here.
--module-completion-threshold NUM
This threshold defines the point at which we consider
a KEGG module to be 'complete' or 'present' in a given
genome or bin. It is the fraction of steps that must
be complete in in order for the entire module to be
marked complete. The default is 0.75.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
--kegg-output-modes MODES
Use this flag to indicate what information you want in
the kegg metabolism output files, by providing a
comma-separated list of output modes (each 'mode' you
provide will result in a different output file, all
with the same prefix). The default output modes are
'kofam_hits' and 'complete_modules'. To see a list of
available output modes, run this script with the flag
--list-available-modes.
--list-available-modes
Use this flag to see the available output modes and
their descriptions.
--custom-output-headers HEADERS
For use with the 'custom' output mode. Provide a
comma-separated list of headers to include in the
output matrix. To see a list of available headers, run
this script with the flag --list-available-output-
headers.
--list-available-output-headers
Use this flag to see the available output headers.
--matrix-format If you want to generate the output in several sparse
matrices instead of one file, use this flag. In each
matrix, contigs DBs will be arranged in columns and
KEGG modules in rows. This output option is especially
appropriate for input option #3.
DEBUG: Parameters to use if you think something fishy is going on or otherwise want to exert more control. Go for it.
--get-raw-data-as-json FILENAME_PREFIX
If you want the raw metabolism estimation data
dictionary in JSON-format, provide a filename prefix
to this argument.The program will then output a file
with the .json extension containing this data.
--store-json-without-estimation
This flag is used to control what is stored in the
JSON-formatted metabolism data dictionary. When this
flag is provided alongside the --get-raw-data-as-json
flag, the JSON file will be created without running
metabolism estimation, and that file will consequently
include only information about KOfam hits and gene
calls. The idea is that you can then modify this file
as you like and re-run this program using the flag
--estimate-from-json.
--estimate-from-json FILE_PATH
If you have a JSON file containing KOfam hits and gene
call information from your contigs database (such as a
file produced using the --get-raw-data-as-json flag),
you can provide that file to this flag and KEGG
metabolism estimates will be computed from the
information within instead of from a contigs database.
Estimates taxonomy at genome and metagenome level. This program is the entry point to estimate taxonomy for a given set of contigs (i.e., all contigs in a contigs database, or contigs described in collections as bins). For this, it uses single-copy core gene sequences and the GTDB database.
Example uses and other resources
Usage
anvi-estimate-scg-taxonomy [-h] [-c CONTIGS_DB] [-m] [-p PROFILE_DB]
[-C COLLECTION_NAME] [-M FILE_PATH]
[-o FILE_PATH] [-O FILENAME_PREFIX]
[--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
[--matrix-format] [--raw-output]
[-T NUM_THREADS] [-S SCG_NAME]
[--report-scg-frequencies FILE_PATH]
[--just-do-it]
[--simplify-taxonomy-information]
[--compute-scg-coverages]
[--update-profile-db-with-taxonomy]
[-r PATH]
Parameters
INPUT #1: The minimum you must provide this program is a contigs database. In which
case anvi'o will attempt to estimate taxonomy for all contigs in it,
assuming that the contigs database represents a single genome. If the
contigs database is actually a metagenome, you should use the
--metagenome
flag to explicitly declare that.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-m, --metagenome-mode
Treat a given contigs database as a metagenome rather
than treating it as a single genome.
INPUT #2: In addition, you can also point out a profile database. In which case you also must provide a collection name. When you do that anvi'o will offer taxonomy estimates for each bin in your collection.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
INPUT #3: You can also work with a metagenomes file, assuming that you have multiple metagenomes with or without associated mapping results, and anvi'o would generate a singe output file for all.
-M FILE_PATH, --metagenomes FILE_PATH
A two-column TAB-delimited flat text file. The header
line must contain these columns: 'name',
'contigs_db_path', and 'profile_db_path'. Each line
should list a single entry, where 'name' can be any
name to describe the metagenome stored in the anvi'o
contigs database. In this context, the anvi'o profiles
associated with contigs database must be SINGLE
PROFILES, as in generated by the program `anvi-
profile` and not `anvi-merge`.
OUTPUT AND FORMATTING: Anvi'o will do its best to offer you some fancy output tables for your viewing pleasure by default. But in addition to that, you can ask the resulting information to be stored in a TAB-delimited file (which is a much better way to include the results in your study as supplementary information, or work with these results using other analysis tools such as R). Depending on the mode you are running this program, anvi'o may ask you to use an 'output file prefix' rather than an 'output file path'.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
The taxonomic level to use whenever relevant and/or
available. The default taxonomic level is None, but if
you choose something specific, anvi'o will focus on
that whenever possible.
--matrix-format If you want the reports to look like sparse matrices
whenever possible, declare this flag. Matrices are
especially good to use when you are working with
internal/external genomes since they can show you
quickly the distribution of each taxon across all
metagenomes in programs like EXCEL. WELL TRY IT AND
SEE.
--raw-output Just store the raw output without any processing of
the primary data structure.
PERFORMANCE: We are not sure if allocating more threads for this operation will change anything. But hey. One can try.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
AUTHORITY: Assert your dominance.
-S SCG_NAME, --scg-name-for-metagenome-mode SCG_NAME
When running in metagenome mode, anvi'o automatically
chooses the most frequent single-copy core gene to
estimate the taxonomic composition within a contigs
database. If you have a different preference you can
use this parameter to communicate that.
--report-scg-frequencies FILE_PATH
Report SCG frequencies in a TAB-delimited file and
quit. This is a great way to decide which SCG name to
use in metagenome mode (we often wish to use the most
frequent SCG to increase the detection of taxa).
--just-do-it Don't bother me with questions or warnings, just do
it.
ADVANCED: Very pro-like stuff.
--simplify-taxonomy-information
The taxonomy output may include a large number of
names that contain clade-specific code for not-yet-
characterized taxa. With this flag you can simplify
taxon names. This will influence all output files and
displays as the use of this flag will on-the-fly trim
taxonomic levels with clade-specific code names.
--compute-scg-coverages
When this flag is declared, anvi'o will go back to the
profile database to learn coverage statistics of
single-copy core genes for which we have taxonomy
information.
--update-profile-db-with-taxonomy
When anvi'o knows all both taxonomic affiliations and
coverages across samples for single-copy core genes,
it can, in theory add this information to the profile
database. With this flag you can instruct anvi'o to do
that and find information on taxonomy in the `layers`
tab of your interactive interface.
BORING: Options that you will likely never need.
-r PATH, --taxonomy-database PATH
Path to the directory that contains the BLAST
databases for single-copy core genes. You will almost
never need to use this parameter unless you are trying
something very fancy. But when you do, you can tell
anvi'o where to look for database files through this
parameter.
why yes we do stuff here.
Usage
anvi-experimental-organization [-h] [-p PROFILE_DB] -c CONTIGS_DB
[-i DIR_PATH] [-N NAME]
[--distance DISTANCE_METRIC]
[--linkage LINKAGE_METHOD]
[--skip-store-in-db] [-o FILE_PATH]
[--dry-run]
FILE
Parameters
positional arguments:
FILE Config file for clustering of contigs. See
documentation for help.
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-i DIR_PATH, --input-directory DIR_PATH
Input directory where the input files addressed from
the configuration file can be found (i.e., the profile
database, if PROFILE.db::TABLE notation is used in the
configuration file).
-N NAME, --name NAME The name to use when storing the resulting clustering
in the database. This name will appear in the
interactive interface and other relevant interfaces.
Please consider using a short and descriptive single-
word (if you do not do that you will make anvi'o
complain).
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
If you do not use this flag, the distance metric you
defined in your clustering config file will be used.
If you have not defined one in your config file, then
the system default will be used, which is "euclidean".
--linkage LINKAGE_METHOD
Same story with the `--distance`, except, the system
default for this one is ward.
--skip-store-in-db By default, analysis results are stored in the profile
database. The use of this flag will let you skip that
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--dry-run Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
Export a collection from an anvi'o database
Usage
anvi-export-collection [-h] -p PAN_OR_PROFILE_DB [-C COLLECTION_NAME]
[-O FILENAME_PREFIX] [--list-collections]
[--include-unbinned]
Parameters
optional arguments:
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
--list-collections Show available collections and exit.
--include-unbinned When this flag is used, anvi'o will also store in the
output file the items that do not appear in any of
your bins. This new bin will be called
'UNBINNED_ITEMS_BIN'. Yes. The ugly name is
intentional.
Export contigs (or splits) from an anvi'o contigs database
Usage
anvi-export-contigs [-h] -c CONTIGS_DB [--contigs-of-interest FILE]
[--splits-mode] -o FILE_PATH [--just-do-it]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--contigs-of-interest FILE
It is possible to focus on only a set of contigs. If
you would like to do that and ignore the rest of the
contigs in your contigs database, use this parameter
with a flat file every line of which desribes a single
contig name.
--splits-mode Export split sequences instead.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--just-do-it Don't bother me with questions or warnings, just do
it.
Export functions of genes from an anvi'o contigs database for a given annotation source
Usage
anvi-export-functions [-h] -c CONTIGS_DB [-o FILE_PATH]
[--annotation-sources SOURCE NAME[S]] [-l]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--annotation-sources SOURCE NAME[S]
Get functional annotations for a specific list of
annotation sources. You can specify one or more
sources by separating them from each other with a
comma character (i.e., '--annotation-sources
source_1,source_2,source_3'). The default behavior is
to return everything
-l, --list-annotation-sources
List available functional annotation sources.
Export gene calls from an anvi'o contigs database.
Usage
anvi-export-gene-calls [-h] -c CONTIGS_DB [-o FILE_PATH]
[--gene-caller GENE-CALLER]
[--list-gene-callers]
[--skip-sequence-reporting]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--gene-caller GENE-CALLER
Which gene caller(s) would you like to export gene
calls for? If providing multiple they should be comma-
separated (no spaces). If you don't know, use --list-
gene-callers
--list-gene-callers List available gene callers in the contigs database
and quit.
--skip-sequence-reporting
By default, exported gene calls have an amino acid
sequences column in the output. Turn this behavior off
with this flag
Export gene coverage and detection data for all genes associated with contigs described in a profile database.
Usage
anvi-export-gene-coverage-and-detection [-h] -p PROFILE_DB -c
CONTIGS_DB -O FILENAME_PREFIX
Parameters
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
Export an item order from an anvi'o database
Usage
anvi-export-items-order [-h] [-p DB PATH] [--name ORDER NAME]
[-o FILE_PATH]
Parameters
INPUT: The database and the items order of interest
-p DB PATH, --db-path DB PATH
An appropriate anvi'o database.
--name ORDER NAME The name of the order you want to export. If you don't
provide an order name, anvi'o will show you the names
of all available orders in the database.
OUPUT: Output file name and stuff
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
This program helps you cut a 'locus' from a larger genetic context (e.g., contigs, genomes). By default, anvi'o will locate a user-defined anchor gene, extend its selection upstream and downstream based on the –num-genes argument, then extract the locus to create a new contigs database. The anchor gene must be provided as –search-term, –gene-caller-ids, or –hmm-sources. If –flank-mode is designated, you MUST provide TWO flanking genes that define the locus region (Please see –flank-mode help for more information). If everything goes as plan, anvi'o will give you individual locus contigs databases for every matching anchor gene found in the original contigs database provided. Enjoy your mini contigs databases!
Usage
anvi-export-locus [-h] -c CONTIGS_DB [-s SEARCH_TERM]
[--gene-caller-ids GENE_CALLER_IDS]
[--delimiter CHAR] [-o DIR_PATH] -O FILENAME_PREFIX
[--flank-mode] [-n NUM_GENES] [--use-hmm]
[--hmm-sources SOURCE NAME] [-l]
[--annotation-sources SOURCE NAME[S]] [-W]
[--remove-partial-hits] [--never-reverse-complement]
Parameters
Essential INPUT:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
Query options for locating locus: search according to either hmm or functional annotations
-s SEARCH_TERM, --search-term SEARCH_TERM
search term.
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
--delimiter CHAR The delimiter to parse multiple input terms. The
default is ','.
THE OUTPUT: Where should the output go. It will be one FASTA file with all matches or one FASTA per match (see –separate-fasta)
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
ADDITIONAL STUFF: Flags and parameters you can set according to your need
--flank-mode If in --flank-mode, anvi-export-locus will extract a
locus based on the coordinates of flanking genes. You
MUST provide 2 flanking genes in the form of TWO
--search-term, --gene-caller-ids, or --hmm-sources.
The --flank-mode option is appropriate for extracting
loci of variable gene number lengths, but are
consistently located between the same flanking genes
in the genome(s) of interest.
-n NUM_GENES, --num-genes NUM_GENES
Required for DEFAULT mode. For each match (to the
function, or HMM that was searched) a sequence which
includes a block of genes will be saved. The block
could include either genes only in the forward
direction of the gene (defined according to the
direction of transcription of the gene) or reverse or
both. If you wish to get both direction use a comma
(no spaces) to define the block For example, '-n 4,5'
will give you four genes before and five genes after.
Whereas, '-n 5' will give you five genes after (in
addition to the gene that matched). To get only genes
preceding the match use '-n 5,0'. If the number of
genes requested exceeds the length of the contig, then
the output will include the sequence until the end of
the contig.
--use-hmm Use HMM hits instead of functional annotations. In
other words, --search-term will be queried against HMM
source annotations, NOT functional annotations. If you
choose this option, you must also say which HMM source
to use.
--hmm-sources SOURCE NAME
Get sequences for a specific list of HMM sources. You
can list one or more sources by separating them from
each other with a comma character (i.e., '--hmm-
sources source_1,source_2,source_3'). If you would
like to see a list of available sources in the contigs
database, run this program with '--list-hmm-sources'
flag.
-l, --list-hmm-sources
List available HMM sources in the contigs database and
quit.
--annotation-sources SOURCE NAME[S]
Get functional annotations for a specific list of
annotation sources. You can specify one or more
sources by separating them from each other with a
comma character (i.e., '--annotation-sources
source_1,source_2,source_3'). The default behavior is
to return everything
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
--remove-partial-hits
By default anvi'o will return hits even if they are
partial. Declaring this flag will make anvi'o filter
all hits that are partial. Partial hits are hits in
which you asked for n1 genes before and n2 genes after
the gene that matched the search criteria but the
search hits the end of the contig before finding the
number of genes that you asked.
--never-reverse-complement
By default, if a gene that is found by the search
criteria is reverse in it's direction, then the
sequence of the entire locus is reversed before it is
saved to the output. If you wish to prevent this
behavior then use the flag --never-reverse-complement.
Export additional data or order tables in pan or profile databases for items or layers.
Usage
anvi-export-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB] -t
NAME [-D NAME] [-o FILE_PATH]
Parameters
Database input: Provide 1 of these
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
Details: Everything else.
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
documentation for more information.
-D NAME, --target-data-group NAME
Data group to focus. Anvi'o misc data tables support
associating a set of data keys with a data group. If
you have no idea what this is, then probably you don't
need it, and anvi'o will take care of you. Note: this
flag is IRRELEVANT if you are working with additional
order data tables.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Export split or contig sequences and coverages across samples stored in an
anvi'o profile database. This program is especially useful if you would like
to 'bin' your splits or contigs outside of anvi'o and import the binning
results into anvi'o using anvi-import-collection
program.
Usage
anvi-export-splits-and-coverages [-h] -p PROFILE_DB -c CONTIGS_DB
[-o DIR_PATH] [-O FILENAME_PREFIX]
[--splits-mode] [--report-contigs]
[--use-Q2Q3-coverages]
Parameters
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
--splits-mode Specify this flag if you would like to output
coverages of individual 'splits', rather than their
'parent' contig coverages.
--report-contigs By default this program reports sequences and their
coverages for 'splits'. By using this flag, you can
report contig sequences and coverages instead. For
obvious reasons, you can't use this flag with
`--splits-mode` flag.
--use-Q2Q3-coverages By default this program reports the mean coverage of a
split (or contig, see --report-contigs) for each
sample. By using this flag, you can report the mean
Q2Q3 coverage by excluding 25 percent of the
nucleotide positions with the smallest coverage
values, and 25 percent of the nucleotide positions
with the largest coverage values. The hope is that
this removes 'outlier' positions resulting from non-
specific mapping, etc. that skew the mean coverage
estimate.
Export taxonomy for splits found in an anvi'o contigs database
Usage
anvi-export-splits-taxonomy [-h] -c CONTIGS_DB -o FILE_PATH
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Export an anvi'o state into a profile database.
Usage
anvi-export-state [-h] -p PAN_OR_PROFILE_DB [-o FILE_PATH]
[-s STATE_NAME] [--list-states]
Parameters
optional arguments:
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-s STATE_NAME, --state STATE_NAME
The state name to export.
--list-states Show available states and exit.
Export .pdb structure files from a structure database.
Usage
anvi-export-structures [-h] -s STRUCTURE_DB [-o DIR_PATH]
[--gene-caller-ids GENE_CALLER_IDS]
[--genes-of-interest FILE]
Parameters
optional arguments:
-s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
Export anvi'o database tables as TAB-delimited text files.
Usage
anvi-export-table [-h] [--table TABLE_NAME] [-l] [-f FIELDS]
[-o FILE_PATH]
DB
Parameters
positional arguments:
DB Anvi'o database to read from.
optional arguments:
--table TABLE_NAME Table name to export.
-l, --list Gives a list of tables in a database and quits. If a
table is already declared this time it lists all the
fields in a given table, in case you would to export
only a specific list of fields from the table using
--fields parameter.
-f FIELD(S), --fields FIELD(S)
Fields to report. Use --list-tables parameter with a
table name to see available fields You can list fields
using this notation: --fields 'field_1, field_2, ...
field_N'.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Generate a new anvi'o contigs database.
Usage
anvi-gen-contigs-database [-h] -f FASTA [-n PROJECT_NAME]
[-o DB_FILE_PATH] [--description TEXT_FILE]
[-L INT] [-K INT] [--skip-gene-calling]
[--prodigal-translation-table INT]
[--external-gene-calls GENE-CALLS]
[--ignore-internal-stop-codons]
[--skip-predict-frame]
[--skip-mindful-splitting]
Parameters
MANDATORY INPUTS: Things you really need to provide to be in business.
-f FASTA, --contigs-fasta FASTA
The FASTA file that contains reference sequences you
mapped your samples against. This could be a reference
genome, or contigs from your assembler. Contig names
in this file must match to those in other input files.
If there is a problem anvi'o will gracefully complain
about it.
-n PROJECT_NAME, --project-name PROJECT_NAME
Name of the project. Please choose a short but
descriptive name (so anvi'o can use it whenever she
needs to name an output file, or add a new table in a
database, or name her first born).
OPTIONAL INPUTS: Things you may want to tweak.
-o DB_FILE_PATH, --output-db-path DB_FILE_PATH
Output file path for the new database.
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
-L INT, --split-length INT
Anvi'o splits very long contigs into smaller pieces,
without actually splitting them for real. These
'virtual' splits improve the efficacy of the
visualization step, and changing the split size gives
freedom to the user to adjust the resolution of their
display when necessary. The default value is (20000).
If you are planning to use your contigs database for
metagenomic binning, we advise you to not go below
10,000 (since the lower the split size is, the more
items to show in the display, and decreasing the split
size does not really help much to binning). But if you
are thinking about using this parameter for ad hoc
investigations other than binning, you should ignore
our advice, and set the split size as low as you want.
If you do not want your contigs to be split, you can
set the split size to '0' or any other negative
integer (lots of unnecessary freedom here, enjoy!).
-K INT, --kmer-size INT
K-mer size for k-mer frequency calculations. The
default k-mer size for composition-based analyses is
4, historically. Although tetra-nucleotide frequencies
seem to offer the the sweet spot of sensitivity,
information density, and manageable number of
dimensions for clustering approaches, you are welcome
to experiment (but maybe you should leave it as is for
your first set of analyses).
--skip-mindful-splitting
By default, anvi'o attempts to prevent soft-splitting
large contigs by cutting proper gene calls to make
sure a single gene is not broken into multiple splits.
This requires a careful examination of where genes
start and end, and to find best locations to split
contigs with respect to this information. So, when the
user asks for a split size of, say, 1,000, it serves
as a mere suggestion. When this flag is used, anvi'o
does what the user wants and creates splits at desired
lengths (although some functionality may become
unavailable for the projects that rely on a contigs
database that is initiated this way).
GENES IN CONTIGS: Expert thingies.
--skip-gene-calling By default, generating an anvi'o contigs database
includes the identification of open reading frames in
contigs by running a bacterial gene caller. Declaring
this flag will by-pass that process. If you prefer,
you can later import your own gene calling results
into the database.
--prodigal-translation-table INT
This is a parameter to pass to the Prodigal for a
specific translation table. This parameter corresponds
to the parameter `-g` in Prodigal, the default value
of which is 11 (so if you do not set anything, it will
be set to 11 in Prodigal runtime. Please refer to the
Prodigal documentation to determine what is the right
translation table for you if you think you need it.)
--external-gene-calls GENE-CALLS
A TAB-delimited file to define external gene calls.
The file must have these columns: 'gene_callers_id' (a
unique integer number for each gene call, start from
1), 'contig' (the contig name the gene call is found),
'start' (start position, integer), 'stop' (stop
position, integer), 'direction' (the direction of the
gene open reading frame; can be 'f' or 'r'), 'partial'
(whether it is a complete gene call, or a partial one;
must be 1 for partial calls, and 0 for complete
calls), 'call_type' (1 if it is coding, 2 if it is
noncoding, or 3 if it is unknown (only gene calls with
call_type = 1 will have amino acid sequences
translated)), 'source' (the gene caller), and
'version' (the version of the gene caller, i.e.,
v2.6.7 or v1.0). An additional 'optional' column is
'aa_sequence' to explicitly define the amino acid
seqeuence of a gene call so anvi'o does not attempt to
translate the DNA sequence itself. An EXAMPLE FILE
(with the optional 'aa_sequence' column (so feel free
to take it out for your own case)) can be found at the
URL https://bit.ly/2qEEHuQ. If you are providing
external gene calls, please also see the flag `--skip-
predict-frame`.
--ignore-internal-stop-codons
This is only relevant when you have an external gene
calls file. If anvi'o figures out that your custom
gene calls result in amino acid sequences with stop
codons in the middle, it will complain about it. You
can use this flag to tell anvi'o to don't check for
internal stop codons, Even though this shouldn't
happen in theory, we understand that it almost always
does. In these cases, anvi'o understands that
sometimes we don't want to care, and will not judge
you. Instead, it will replace every stop codon residue
in the amino acid sequence with an 'X' character.
Please let us know if you used this and things failed,
so we can tell you that you shouldn't have really used
it if you didn't like failures at the first place
(smiley).
--skip-predict-frame When you have provide an external gene calls file,
anvi'o will predict the correct frame for gene calls
as best as it can by using a previously-generated
Markov model that is trained using the uniprot50
database (see this for details:
https://github.com/merenlab/anvio/pull/1428), UNLESS
there is an `aa_sequence` entry for a given gene call
in the external gene calls file. Please note that
PREDICTING FRAMES MAY CHANGE START/STOP POSITIONS OF
YOUR GENE CALLS SLIGHTLY, if those that are in the
external gene calls file are not describing proper
gene calls according to the model. If you use this
flag, anvi'o will not rely on any model and will
attempt to translate your DNA sequences by solely
relying upon start/stop positions in the file, but it
will complain about sequences start/stop positions of
which are not divisible by 3.
Generate a pairwise matrix of a fixation indices between samples
Example uses and other resources
Usage
anvi-gen-fixation-index-matrix [-h] [-p PROFILE_DB] [-c CONTIGS_DB]
[-s STRUCTURE_DB] [-V VARIABILITY_TABLE]
[-C COLLECTION_NAME] [-b BIN_NAME]
[--splits-of-interest FILE]
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS]
[--only-if-structure]
[--samples-of-interest FILE]
[--engine ENGINE]
[--min-coverage-in-each-sample INT]
[-o FIXATION_INDICES] [--keep-negatives]
Parameters
DATABASES: Declaring relevant anvi'o databases. First things first. Some are mandatory, some are optional.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.
-V VARIABILITY_TABLE, --variability-profile VARIABILITY_TABLE
The output of anvi-gen-variability-profile, or a
different variant-calling output that has been
converted to the anvi'o format.
FOCUS :: BIN: You need to pick someting to focus. You can ask anvi'o to work with a bin in a collection.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
FOCUS :: SPLIT NAMES: Alternatively you can declare split names to focus.
--splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
FOCUS :: GENE CALLER IDs: Alternatively you can declare gene caller IDs to focus.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
--only-if-structure If provided, your genes of interest will be further
subset to only include genes with structures in your
structure database, and therefore must be supplied in
conjunction with a structure database, i.e. `-s
<your_structure_database>`. If you did not specify
genes of interest, ALL genes will be subset to those
that have structures.
SAMPLES: You can ask anvi'o to focus only on a subset of samples.
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
ENGINE: Set your engine. This is important as it will define the output profile you will get from this program. The engine can focus on nucleotides (NT), codons (CDN), or an amino acids (AA).
--engine ENGINE Variability engine. The default is 'NT'.
FILTERS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…
--min-coverage-in-each-sample INT
Minimum coverage of a given variable nucleotide
position in all samples. If a nucleotide position is
covered less than this value even in one sample, it
will be removed from the analysis. Default is 0.
OUTPUT: Output file and style
-o FIXATION_INDICES, --output-file FIXATION_INDICES
File path to store results.
EXTRAS: Because why not be extra?
--keep-negatives Negative numbers are theoretically possible, and are
sometimes interpreted as out-breeding. By default, we
set negative numbers to 0 so the results are
reflective of a standard distance metric. Provide this
flag if you would prefer otherwise.
Collapse variability for a set of genes across samples
Usage
anvi-gen-gene-consensus-sequences [-h] -p PROFILE_DB -c CONTIGS_DB
[--gene-caller-ids GENE_CALLER_IDS]
[--genes-of-interest FILE]
[--samples-of-interest FILE]
[-o FILE_PATH] [--tab-delimited]
[--engine ENGINE] [--contigs-mode]
[--quince-mode] [--compress-samples]
Parameters
optional arguments:
--compress-samples Normally all samples with variation will have their
own consensus sequence. If this flag is provided, the
coverages from each sample of interest will be summed
and only a single consenus sequence for each
gene/contig will be output.
DATABASES: Declaring relevant anvi'o databases. First things first.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
FOCUS: What do we want? A consensus sequence for a gene, or a list of genes. From where do we want it? All samples, by default. When do we want it? Whenever it is convenient.
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
OUTPUT: Output file and output style
-o FILE_PATH, --output-file FILE_PATH
The output file name. The boring default is
"genes.fa". You can change the output file format to a
TAB-delimited file using teh flag `--tab-delimited`,
in which case please do not forget to change the file
name, too.
--tab-delimited Use the TAB-delimited format for the output file.
EXTRAS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…
--engine ENGINE Variability engine. The default is 'NT'.
--contigs-mode Use this flag to output consensus sequences of
contigs, instead of the default, which is genes
--quince-mode Use this flag to output consensus sequences for cases
even where there is no variability
A program to compute genes databases for a ginen set of bins stored in an
anvi'o collection. Genes databases store gene-level coverage and detection
statistics, and they are usually computed and generated automatically when
they are required (such as running anvi-interactive with --gene-mode
flag).
This program allows you to pre-compute them if you don't want them to be done
all at once.
Usage
anvi-gen-gene-level-stats-databases [-h] -c CONTIGS_DB -p PROFILE_DB
[-C COLLECTION_NAME] [-b BIN_NAME]
[-B FILE_PATH]
[--zeros-are-outliers]
[--outliers-threshold NUM]
[--just-do-it] [--inseq-stats]
Parameters
INPUT DATABASES: Which anvi'o databases do you wish to work today?
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
BIN(S) AND COLLECTION: You can select a bin, multiple bins, or you can simply focus on every bin in a collection by providing only a collection name. Once you are done with your selection, anvi'o will generate an individual genes database for each of the bin it finds.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
ADDITIONAL PARAMETERS: These parameters are those that are critical to identify outlier nucleotide positions and how to define what should be included in those calculations. In most cases you can leave them as is, and things are going to be alright.
--zeros-are-outliers If you want all zero coverage positions to be treated
like outliers then use this flag. The reason to treat
zero coverage as outliers is because when mapping
reads to a reference we could get many zero positions
due to accessory genes. These positions then skew the
average values that we compute.
--outliers-threshold NUM
Threshold to use for the outlier detection. The
default value is '1.5'. Absolute deviation around the
median is used. To read more about the method please
refer to: 'How to Detect and Handle Outliers' by Boris
Iglewicz and David Hoaglin
(doi:10.1016/j.jesp.2013.03.013).
PARAMETERS OF CONVENIENCE: They say they save lives.
--just-do-it Don't bother me with questions or warnings, just do
it.
INSEQ DATA: When analyzing INSeq/Tn-Seq data
--inseq-stats Provide if working with INSeq/Tn-Seq genomic data.
With this, all gene level coverage stats will be
calculated using INSeq/Tn-Seq statistical methods.
Create a genome storage from internal and/or external genomes for a pangenome analysis.
Example uses and other resources
Usage
anvi-gen-genomes-storage [-h] [-e FILE_PATH] [-i FILE_PATH]
[--gene-caller GENE-CALLER] -o GENOMES_STORAGE
Parameters
EXTERNAL GENOMES: External genomes listed as anvi'o contigs databases. As in, you have one or more genomes say from NCBI you want to work with, and you created an anvi'o contigs database for each one of them.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
line should read 'name', and the second should read
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
INTERNAL GENOMES: Genome bins stored in an anvi'o profile databases as collections.
-i FILE_PATH, --internal-genomes FILE_PATH
A five-column TAB-delimited flat text file. The header
line must contain these columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.
PRO STUFF: Things you may not have to change. But you never know (unless you read the help).
--gene-caller GENE-CALLER
The gene caller to utilize. Anvi'o supports multiple
gene callers, and some operations (including this one)
requires an explicit mentioning of which one to use.
The default is 'prodigal', but it will not be enough
if you if you were a rebel and have used `--external-
gene-callers` or something.
OUTPUT: Give it a nice name. Must end with '-GENOMES.db'. This is primarily due to the fact that there are other .db files used throughout anvi'o and it would be better to distinguish this very special file from them.
-o GENOMES_STORAGE, --output-file GENOMES_STORAGE
File path to store results.
Generate a Gephi network for functions based on non-normalized gene coverage values
Usage
anvi-gen-network [-h] -p PROFILE_DB -c CONTIGS_DB
[--annotation-source SOURCE NAME] [-l]
Parameters
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--annotation-source SOURCE NAME
Get functional annotations for a specific annotation
source. You can use the flag '--list-annotation-
sources' to learn about what sources are available.
-l, --list-annotation-sources
List available functional annotation sources.
Generate phylogenomic tree from aligment file.
Usage
anvi-gen-phylogenomic-tree [-h] -f FASTA -o FILE_PATH
[--program PROGRAM_NAME]
Parameters
INPUT FILES: Concatenated aligment files exported using anvi-get-sequences-for-gene- clusters
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file
OUTPUT FILE: The output file where the generated newick tree will be stored.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
PROGRAM: The program that will be used for generating tree. Available options: default, fasttree
--program PROGRAM_NAME
Program name.
Identifies genes in your contigs database that encode proteins that are homologous to proteins with solved structures. If sufficiently similar homologs are identified, they are used as structural templates to predict the 3D structure of proteins in your contigs database.
Example uses and other resources
Usage
anvi-gen-structure-database [-h] -c CONTIGS_DB [--pdb-db PDB_DB]
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS]
[-o DB_FILE_PATH] [--dump-dir DUMP_DIR]
[--num-models NUM_MODELS]
[--deviation DEVIATION]
[--modeller-database MODELLER_DATABASE]
[--scoring-method SCORING_METHOD]
[--very-fast]
[--percent-identical-cutoff PERCENT_IDENTICAL_CUTOFF]
[--max-number-templates MAX_NUMBER_TEMPLATES]
[--skip-DSSP]
[--modeller-executable MODELLER_EXECUTABLE]
[--offline-mode] [-T NUM_THREADS]
[--write-buffer-size-per-thread INT]
Parameters
DATABASES: Declaring relevant anvi'o databases. First things first.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--pdb-db PDB_DB By default, this program accesses the structure files
it needs from an internal anvi'o database that can be
set up with anvi-setup-pdb-database. If a required
structure is not in this database, it will instead be
downloaded from the RCSB PDB server. This parameter
exists only if a) you created a database and b) it
exists in a custom location. In this case, please
provide that path here. Otherwise we vibing.
GENES: Specifying which genes you want to be modelled.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
OUTPUT: Output file and output style.
-o DB_FILE_PATH, --output-db-path DB_FILE_PATH
Output file path for the new database.
--dump-dir DUMP_DIR Modeling and annotating structures requires a lot of
moving parts, each which have their own outputs. The
output of this program is a structure database
containing the pertinent results of this computation,
however a lot of stuff doesn't make the cut. By
providing a directory for this parameter you will get,
in addition to the structure database, a directory
containing the raw output for everything.
MODELLER PARAMS: Parameters for MODELLER's homology modeling.
--num-models NUM_MODELS, -N NUM_MODELS
This parameter determines the number of predicted
structures that are solved for a given protein. The
original atomic positions for each model are perturbed
by an amount defined by --deviation, which leads to
differences between each model. Therefore, whichever
of the N models is chosen to be the "best" model is
more likely to be accurate when --num-models is high,
since more of the solution space is searched. It
should be kept in mind that the largest determinant of
a model's accuracy is determined by the protein
templates used, so no need to go overboard with an
excessively large --num-models. The default is 1.
--deviation DEVIATION, -d DEVIATION
Deviation (angstroms)
--modeller-database MODELLER_DATABASE, -D MODELLER_DATABASE
Which database do you want to search the structures
of? Default is "pdb_95". If you have your own database
it must have either the extension .bin or .pir. If you
don't have a database or don't know what this means,
don't worry, we will both inform you and take care of
you.
--scoring-method SCORING_METHOD, -b SCORING_METHOD
How should the best model be decided? The metric used
could be any of GA341_score, DOPE_score, and molpdf.
GA341 is an absolute measure, where a good model will
have a score near 1.0, whereas anything below 0.6 can
be considered bad. DOPE and molpdf scores are relative
energy measures, where lower scores are better. DOPE
has been generally shown to be a better distinguisher
between good and bad models than molpdf. By default,
DOPE is used. To learn more see the MODELLER tutorial:
https://salilab.org/modeller/tutorial/basic.html.
--very-fast If provided, a very fast optimization is done for each
model at the cost of accuracy. It is recommended to
use a --num-models of 1, since the optimization is so
crude that all models will likely converge to the same
solution.
--percent-identical-cutoff PERCENT_IDENTICAL_CUTOFF, -p PERCENT_IDENTICAL_CUTOFF
If a protein in the database has a proper percent
identity to the gene of interest that is greater than
or equal to --percent-identical-cutoff, then it is
used as a template. Otherwise it is not. Here we
define proper percent identity as the percentage of
amino acids in the gene of interest that are identical
to an entry in the database given the sequence length
of the gene of interest. For example, if there is 100%
identity between the gene of interest and the template
over the length of the alignment, but the alignment
length is only half of the gene of interest sequence
length, then the proper percent identical is 50%.
(This helps us avoid the inflation of identity scores
due to only partially good matches). The default is
30.
--max-number-templates MAX_NUMBER_TEMPLATES, -t MAX_NUMBER_TEMPLATES
Generally speaking it is best to use as many templates
as possible given that they have high proper percent
identity to the gene of interest. Taken from https://s
alilab.org/modeller/methenz/andras/node4.html: 'The
use of several templates generally increases the model
accuracy. One strength of MODELLER is that it can
combine information from multiple template structures,
in two ways. First, multiple template structures may
be aligned with different domains of the target, with
little overlap between them, in which case the
modeling procedure can construct a homology-based
model of the whole target sequence. Second, the
template structures may be aligned with the same part
of the target, in which case the modeling procedure is
likely to automatically build the model on the locally
best template [43,44]. In general, it is frequently
beneficial to include in the modeling process all the
templates that differ substantially from each other,
if they share approximately the same overall
similarity to the target sequence.' The default is 5.
EXTRA: Everything else.
--skip-DSSP Dictionary of Secondary Structure of Proteins (DSSP)
is a program that takes as its input a protein
structure file and outputs predicted secondary
structure (alpha helix, beta strand, etc.), measures
of solvent accessibility, and hydrogen bonds for each
residue in the protein. If for some reason you don't
want this, provide this flag.
--modeller-executable MODELLER_EXECUTABLE
The MODELLER program to use. For example, `mod9.19`.
Anvi'o will try and find it if not provided
--offline-mode Anvi'o first tries to obtain template structures from
a database (see --pdb-db for details). If the
requested template does not exist in the database, its
structure will be downloaded from the RCSB PDB server.
However, if you don't have access to internet, or hate
the RCSB PDB, provide this flag so that all operations
of this program remain offline. If the template
structure is not in the database, then no template
structure for you.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--write-buffer-size-per-thread INT
How many items should be kept in memory before they
are written do the disk. The default is 25 per thread.
So a single-threaded job would have a write buffer
size of 25, whereas a job with 4 threads would have a
write buffer size of 4*25. The larger the buffer size,
the less frequent the program will access to the disk,
yet the more memory will be consumed since the
processed items will be cleared off the memory only
after they are written to the disk. The default buffer
size will likely work for most cases. Please keep an
eye on the memory usage output to make sure the memory
use never exceeds the size of the physical memory. If
--num-threads is 1, this parameter is ignored because
the DB is written to after each gene
Generate a variability matrix (potentially outdated program)
Usage
anvi-gen-variability-matrix [-h] -c CONTIGS_DB --splits-of-interest
FILE [--samples-of-interest FILE]
[--num-positions-from-each-split INT]
[-m INT] [-r RATIO] [-o FILE_PATH]
SUMMARY_DICT
Parameters
positional arguments:
SUMMARY_DICT Summary file
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
--num-positions-from-each-split INT
Each split may have one or more variable positions. By
default, anvi'o will report every SNV position found
in a given split. This parameter will help you to
define a cutoff for the maximum number of SNVs to be
reported from a split (if the number of SNVs is more
than the number you declare using this parameter, the
positions will be randomly subsampled).
-m INT, --min-scatter INT
This one is tricky. If you have N samples in your
dataset, a given variable position x in one of your
splits can split your N samples into `t` groups based
on the identity of the variation they harbor at
position x. For instance, `t` would have been 1, if
all samples had the same type of variation at position
x (which would not be very interesting, because in
this case position x would have zero contribution to a
deeper understanding of how these samples differ based
on variability. When `t` > 1, it would mean that
identities at position x across samples do differ. But
how much scattering occurs based on position x when t
> 1? If t=2, how many samples ended in each group?
Obviously, even distribution of samples across groups
may tell us something different than uneven
distribution of samples across groups. So, this
parameter filters out any x if 'the number of samples
in the second largest group' (=scatter) is less than
-m. Here is an example: let's assume you have 7
samples. While 5 of those have AG, 2 of them have TC
at position x. This would mean scatter of x is 2. If
you set -m to 2, this position would not be reported
in your output matrix. The default value for -m is 0,
which means every `x` found in the database and
survived previous filtering criteria will be reported.
Naturally, -m cannot be more than half of the number
of samples. Please refer to the user documentation if
this is confusing.
-r RATIO, --min-ratio-of-competings-nts RATIO
Minimum ratio of the competing nucleotides at a given
position. Default is 0.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
A program to generate a network description from an anvi'o variability profile (potentially outdated program).
Usage
anvi-gen-variability-network [-h] -i VARIABILITY_PROFILE
[-n NUM_POSITIONS] [-o FILE_PATH]
Parameters
optional arguments:
-i VARIABILITY_PROFILE, --input-file VARIABILITY_PROFILE
The anvi'o variability profile. Please see `anvi-gen-
variability-profile` to generate one.
-n NUM_POSITIONS, --max-num-unique-positions NUM_POSITIONS
Maximum number of unique positions to be used in the
network. This may be one way to avoid extremely large
network descriptions that would defeat the purpose of
a quick visualization. If there are more unique
positions in the variability profile, the program will
randomly select a subset of them to match the `max-
num-unique-positions`. The default is 0, which means
all positions should be reported. Remember that the
number of nodes in the network will also depend on the
number of samples described in the variability
profile.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Generate a table that comprehensively summarizes the variability of nucleotide, codon, or amino acid positions. We call these single nucleotide variants (SNVs), single codon variants (SCVs), and single amino acid variants (SAAVs), respectively. Learn more here: http://merenlab.org/2015/07/20/analyzing-variability/
Example uses and other resources
Usage
anvi-gen-variability-profile [-h] -p PROFILE_DB -c CONTIGS_DB
[-s STRUCTURE_DB] [-C COLLECTION_NAME]
[-b BIN_NAME] [--splits-of-interest FILE]
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS]
[--only-if-structure]
[--samples-of-interest FILE]
[--engine ENGINE] [--skip-synonymity]
[--num-positions-from-each-split INT]
[-r FLOAT] [-z FLOAT] [-j FLOAT]
[-a FLOAT] [-x NUM_SAMPLES]
[--min-coverage-in-each-sample INT]
[--quince-mode] [-o VARIABILITY_PROFILE]
[--include-contig-names]
[--include-split-names]
[--compute-gene-coverage-stats]
Parameters
DATABASES: Declaring relevant anvi'o databases. First things first. Some are mandatory, some are optional.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.
FOCUS :: BIN: You need to pick someting to focus. You can ask anvi'o to work with a bin in a collection.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
FOCUS :: SPLIT NAMES: Alternatively you can declare split names to focus.
--splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
FOCUS :: GENE CALLER IDs: Alternatively you can declare gene caller IDs to focus.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
--only-if-structure If provided, your genes of interest will be further
subset to only include genes with structures in your
structure database, and therefore must be supplied in
conjunction with a structure database, i.e. `-s
<your_structure_database>`. If you did not specify
genes of interest, ALL genes will be subset to those
that have structures.
SAMPLES: You can ask anvi'o to focus only on a subset of samples.
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
ENGINE: Set your engine. This is important as it will define the output profile you will get from this program. The engine can focus on nucleotides (NT), codons (CDN), or an amino acids (AA).
--engine ENGINE Variability engine. The default is 'NT'.
--skip-synonymity Computing synonymity can be an expensive operation for
large data sets. Provide this flag to skip computing
synonymity. It only makes sense to provide this flag
when using --engine CDN.
FILTERS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…
--num-positions-from-each-split INT
Each split may have one or more variable positions. By
default, anvi'o will report every SNV position found
in a given split. This parameter will help you to
define a cutoff for the maximum number of SNVs to be
reported from a split (if the number of SNVs is more
than the number you declare using this parameter, the
positions will be randomly subsampled).
-r FLOAT, --min-departure-from-reference FLOAT
Takes a value between 0 and 1, where 1 is maximum
divergence from the reference. Default is 0.000000.
The reference here observation that corresponds to a
given position in the mapped context.
-z FLOAT, --max-departure-from-reference FLOAT
Similar to '--min-departure-from-reference', but
defines an upper limit for divergence. The default is
1.000000.
-j FLOAT, --min-departure-from-consensus FLOAT
Takes a value between 0 and 1, where 1 is maximum
divergence from the consensus for a given position.
The default is 0.000000. The consensus is the most
frequent observation at a given position.
-a FLOAT, --max-departure-from-consensus FLOAT
Similar to '--min-departure-from-consensus', but
defines an upper limit for divergence. The default is
1.000000.
-x NUM_SAMPLES, --min-occurrence NUM_SAMPLES
Minimum number of samples a nucleotide position should
be reported as variable. Default is 1. If you set it
to 2, for instance, each eligible variable position
will be expected to appear in at least two samples,
which will reduce the impact of stochastic, or
unintelligible variable positions.
--min-coverage-in-each-sample INT
Minimum coverage of a given variable nucleotide
position in all samples. If a nucleotide position is
covered less than this value even in one sample, it
will be removed from the analysis. Default is 0.
--quince-mode The default behavior is to report base frequencies of
nucleotide positions only if there is any variation
reported during profiling (which by default uses some
heuristics to minimize the impact of error-driven
variation). So, if there are 10 samples, and a given
position has been reported as a variable site during
profiling in only one of those samples, there will be
no information will be stored in the database for the
remaining 9. When this flag is used, we go back to
each sample, and report base frequencies for each
sample at this position even if they do not vary. It
will take considerably longer to report when this flag
is on, and the use of it will increase the file size
dramatically, however it is inevitable for some
statistical approaches (as well as for some beautiful
visualizations).
OUTPUT: Output file and style
-o VARIABILITY_PROFILE, --output-file VARIABILITY_PROFILE
File path to store results.
--include-contig-names
Use this flag if you would like contig names for each
variable position to be included in the output file as
a column. By default, we do not include contig names
since they can practically double the output file size
without any actual benefit in most cases.
--include-split-names
Use this flag if you would like split names for each
variable position to be included in the output file as
a column.
--compute-gene-coverage-stats
If provided, gene coverage statistics will be appended
for each entry in variability report. This is very
useful information, but will not be included by
default because it is an expensive operation, and may
take some additional time.
Fetches the number of times each amino acid occurs from a contigs database in a given bin, set of contigs, or set of genes
Usage
anvi-get-aa-counts [-h] -c CONTIGS_DB [-o FILE_PATH] [-p PROFILE_DB]
[-C COLLECTION_NAME] [-B FILE_PATH]
[--contigs-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS]
Parameters
MANDATORY STUFF: You have to set the following two parameters, then you will select one set of parameters from the following optional sections. If you select nothing from those sets, AA counts for everything in the contigs database will be reported.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
OPTIONAL PARAMS FOR BINS:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
OPTIONAL PARAMS FOR CONTIGS:
--contigs-of-interest FILE
A file with contig names. There should be only one
column in the file, and each line should correspond to
a unique split name.
OPTIONAL PARAMS FOR GENE CALLS:
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
Get amino acid or codon frequencies of genes in a contigs database.
Usage
anvi-get-codon-frequencies [-h] -c CONTIGS_DB
[--gene-caller-id GENE_CALLER_ID]
[--return-AA-frequencies-instead] -o
FILE_PATH [--percent-normalize]
[--merens-codon-normalization]
Parameters
INPUT DATABASE: The contigs database. Clearly those genes must be read from somewhere.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
OPTIONALS: Important things to read never end. Stupid science.
--gene-caller-id GENE_CALLER_ID
OK. You can declare a single gene caller ID if you
wish, in which case anvi'o would only return results
for a single gene call. If you don't declare anything,
well, you must be prepared to brace yourself if you
are working with a very large contigs database with
hundreds of thousands of genes.
--return-AA-frequencies-instead
By default, anvi'o will return codon frequencies (as
the name suggests), but you can ask for amino acid
frequencies instead, simply because you always need
more data and more stuff. You're lucky this time, but
is there an end to this? Will you ever be satisfied
with what you have? Anvi'o needs answers.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--percent-normalize Instead of actual counts, report percent-normalized
frequencies per gene (because you are too lazy to do
things the proper way in R).
--merens-codon-normalization
This is a flag to percent normalize codon frequenies
within those that encode for the same amino acid. It
is different from the flag --percent-normalize, since
it does not percent normalize frequencies of codons
within a gene based on all codon frequencies. Clearly
this flag is not applicable if you wish to work with
boring amino acids. WHO WORKS WITH AMINO ACIDS
ANYWAYS.
A program that takes a pangenome, and a categorical layers additional data item, and generates the input for anvi-get-enriched-functions-per-pan-group. If requested a functional occurrence table across genomes is also generated.
pangenomics
functions
Example uses and other resources
Usage
anvi-get-enriched-functions-per-pan-group [-h] -p PAN_DB
[-g GENOMES_STORAGE]
[--category-variable CATEGORY]
[--annotation-source SOURCE NAME]
[-l]
[--include-gc-identity-as-function]
-o FILE_PATH [-F FILE]
[--exclude-ungrouped]
[--just-do-it]
Parameters
INPUT FILES: Input files from the pangenome analysis.
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
CATEGORY VARIABLE AND FUNCTIONAL ANNOTATION SOURCE: This is the layers additional data item in which your genomes are split into multiple groups. So anvi'o can figure out what functions are specific to each group of genomes in your pangenomic analysis. If this is not making any sense, please take a look at the online tutorial for pangenomics (http://merenlab.org/2016/11/08/pangenomics-v2/).
--category-variable CATEGORY
The additional layers data variable name that divides
layers into multiple categories.
--annotation-source SOURCE NAME
Get functional annotations for a specific annotation
source. You can use the flag '--list-annotation-
sources' to learn about what sources are available.
-l, --list-annotation-sources
List available functional annotation sources.
--include-gc-identity-as-function
This is an option that asks anvi'o to treat gene
cluster names as functions. By doing so, you are in
fact creating an opportunity to study functional
enrichment statistics for each gene cluster
independently. For instance, multiple gene clusters
may have the same COG function. But if you wish to use
the same enrichment analysis in your pangenome without
collapsing multiple gene clusters into a single
function name, you can use this flag, and ask for
'IDENTITY' as the functional annotation source.
REPORTING: Output and stuff.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-F FILE, --functional-occurrence-table-output FILE
Saves the occurrence frequency information for
functions in genomes in a TAB-delimited format. A file
name must be provided. To learn more about how the
functional occurrence is computed, please refer to the
tutorial.
OPTIONAL PARAMETERS: Parameters to help you filter the output.
--exclude-ungrouped Use this flag if you want anvi'o to ignore genomes
with no value set for the catergory variable (which
you specified using --category-variable). By default
all variables with no value will be considered as a
single group when preforming the statistical analysis.
MORE OPTIONAL THINGS: Parameters that are there for you to help you in any way they can.
--just-do-it Don't bother me with questions or warnings, just do
it.
A script to get back sequences for gene calls
Usage
anvi-get-sequences-for-gene-calls [-h] [-c CONTIGS_DB]
[--gene-caller-ids GENE_CALLER_IDS]
[--delimiter CHAR]
[--report-extended-deflines]
[--wrap WRAP] [--export-gff3]
[--get-aa-sequences]
[-g GENOMES_STORAGE]
[-G GENOME_NAMES] -o FILE_PATH
Parameters
OPTION #1: EXPORT FROM CONTIGS DB:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
--delimiter CHAR The delimiter to parse multiple input terms. The
default is ','.
--report-extended-deflines
When declared, the deflines in the resulting FASTA
file will contain more information.
--wrap WRAP When to wrap sequences when storing them in a FASTA
file. The default is '120'. A value of '0' would be
equivalent to 'do not wrap'.
--export-gff3 If this is true, the output file will be in GFF3
format.
--get-aa-sequences Store amino acid sequences instead.
OPTION #2: EXPORT FROM A GENOMES STORAGE:
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
-G GENOME_NAMES, --genome-names GENOME_NAMES
Genome names to 'focus'. You can use this parameter to
limit the genomes included in your analysis. You can
provide these names as a comma-separated list of
names, or you can put them in a file, where you have a
single genome name in each line, and provide the file
path.
OPTIONS COMMON TO ALL INPUTS:
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Do cool stuff with gene clusters in anvi'o pan genomes
Usage
anvi-get-sequences-for-gene-clusters [-h] -p PAN_DB
[-g GENOMES_STORAGE] [-o FASTA]
[--report-DNA-sequences]
[--gene-cluster-id GENE_CLUSTER_ID]
[--gene-cluster-ids-file FILE_PATH]
[-C COLLECTION_NAME] [-b BIN_NAME]
[--min-num-genomes-gene-cluster-occurs INTEGER]
[--max-num-genomes-gene-cluster-occurs INTEGER]
[--min-num-genes-from-each-genome INTEGER]
[--max-num-genes-from-each-genome INTEGER]
[--max-num-gene-clusters-missing-from-genome INTEGER]
[--min-functional-homogeneity-index FLOAT]
[--max-functional-homogeneity-index FLOAT]
[--min-geometric-homogeneity-index FLOAT]
[--max-geometric-homogeneity-index FLOAT]
[--min-combined-homogeneity-index FLOAT]
[--max-combined-homogeneity-index FLOAT]
[--add-into-items-additional-data-table NAME]
[--list-collections] [--list-bins]
[--concatenate-gene-clusters]
[--partition-file FILE_PATH]
[--separator STRING]
[--align-with ALIGNER]
[--list-aligners] [--just-do-it]
[--dry-run]
Parameters
INPUT FILES: Input files from the pangenome analysis.
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
OUTPUT: You get to chose an output file name to report things. The default will be an ugly name. So, be explicit.
-o FASTA, --output-file FASTA
File path to store results.
--report-DNA-sequences
By default, this program reports amino acid sequences.
Use this flag to report DNA sequences instead.
SELECTION: Which gene clusters should be reported. You can ask for a single gene cluster, or multiple ones listed in a file, or you can use a collection and bin name to list gene clusters of interest. If you give nothing, this program will export alignments for every single gene cluster found in the profile database (and this is called 'customer service').
--gene-cluster-id GENE_CLUSTER_ID
Gene cluster ID you are interested in.
--gene-cluster-ids-file FILE_PATH
Text file for gene clusters (each line should contain
be a unique gene cluster id).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
ADVANCED FILTERS: If you are here you must be looking for ways to specify exactly what you want from that database of gene clusters. These filters will be applied to what your previous selections reported.
--min-num-genomes-gene-cluster-occurs INTEGER
This filter will remove gene clusters from your
report. Let's assume you have 100 genomes in your pan
genome analysis. You can use this parameter if you
want to work only with gene clusters that occur in at
least X number of genomes. If you say '--min-num-
genomes-gene-cluster-occurs 90', each gene cluster in
the analysis will be required at least to appear in 90
genomes. If a gene occurs in less than that number of
genomes, it simply will not be reported. This is
especially useful for phylogenomic analyses, where you
may want to only focus on gene clusters that are
prevalent across the set of genomes you wish to
analyze.
--max-num-genomes-gene-cluster-occurs INTEGER
This filter will remove gene clusters from your
report. Let's assume you have 100 genomes in your pan
genome analysis. You can use this parameter if you
want to work only with gene clusters that occur in at
most X number of genomes. If you say '--max-num-
genomes-gene-cluster-occurs 1', you will get gene
clusters that are singletons. Combining this parameter
with --min-num-genomes-gene-cluster-occurs can give
you a very precise way to filter your gene clusters.
--min-num-genes-from-each-genome INTEGER
This filter will remove gene clusters from your
report. If you say '--min-num-genes-from-each-genome
2', this filter will remove every gene cluster, to
which every genome in your analysis contributed less
than 2 genes. This can be useful to find out gene
clusters with many genes from many genomes (such as
conserved multi-copy genes within a clade).
--max-num-genes-from-each-genome INTEGER
This filter will remove gene clusters from your
report. If you say '--max-num-genes-from-each-genome
1', every gene cluster that has more than one gene
from any genome that contributes to it will be removed
from your analysis. This could be useful to remove
gene clusters with paralogs from your report for
appropriate phylogenomic analyses. For instance, using
'--max-num-genes-from-each-genome 1' and 'min-num-
genomes-gene-cluster-occurs X' where X is the total
number of your genomes, would give you the single-copy
gene clusters in your pan genome.
--max-num-gene-clusters-missing-from-genome INTEGER
This filter will remove genomes from your report. If
you have a list of gene cluster names, you can use
this parameter to omit any genome from your report if
it is missing more than a number of genes you desire.
For instance, if you have 100 genomes in your pan
genome, and you are interested in working only with
genomes that have all 5 specific gene clusters of your
choice, you can use '--max-num-gene-clusters-missing-
from-genome 4' to remove remove the bins that are
missing more than 4 of those 5 genes. This is
especially useful for phylogenomic analyses. Parameter
0 will remove any genome that is missing any of the
genes.
--min-functional-homogeneity-index FLOAT
This filter will remove gene clusters from your
report. If you say '--min-functional-homogeneity-index
0.3', every gene cluster with a functional homogeneity
index less than 0.3 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that are highly conserved in
resulting function
--max-functional-homogeneity-index FLOAT
This filter will remove gene clusters from your
report. If you say '--max-functional-homogeneity-index
0.5', every gene cluster with a functional homogeneity
index greater than 0.5 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that don't seem to be functionally
conserved
--min-geometric-homogeneity-index FLOAT
This filter will remove gene clusters from your
report. If you say '--min-geometric-homogeneity-index
0.3', every gene cluster with a geometric homogeneity
index less than 0.3 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that are highly conserved in
geometric configuration
--max-geometric-homogeneity-index FLOAT
This filter will remove gene clusters from your
report. If you say '--max-geometric-homogeneity-index
0.5', every gene cluster with a geometric homogeneity
index greater than 0.5 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that have many not be as conserved as
others
--min-combined-homogeneity-index FLOAT
This filter will remove gene clusters from your
report. If you say '--min-combined-homogeneity-index
0.3', every gene cluster with a combined homogeneity
index less than 0.3 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that are highly conserved overall
--max-combined-homogeneity-index FLOAT
This filter will remove gene clusters from your
report. If you say '--max-combined-homogeneity-index
0.5', every gene cluster with a combined homogeneity
index greater than 0.5 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that have many not be as conserved
overall as others
--add-into-items-additional-data-table NAME
If you use any of the filters, and would like to add
the resulting item names into the items additional
data table of your database, you can use this
parameter. You will need to give a name for these
results to be saved. If the given name is already in
the items additional data table, its contents will be
replaced with the new one. Then you can run anvi-
interactive or anvi-display-pan to 'see' the results
of your filters.
OTHER STUFF: Yes. Stuff that are not like the ones above.
--list-collections Show available collections and exit.
--list-bins List available bins in a collection and exit.
PHYLOGENOMICS: Get separately aligned and concatenated sequences for phylogenomics.
--concatenate-gene-clusters
Concatenate output gene clusters in the same order to
create a multi-gene alignment output that is suitable
for phylogenomic analyses.
--partition-file FILE_PATH
Some commonly used software for phylogenetic analyses
(e.g., IQ-TREE, RAxML, etc) allow users to
specify/test different substitution models for each
gene of a concatenated multiple sequence alignments.
For this, they use a special file format called a
'partition file', which indicates the site for each
gene in the alignment. You can use this parameter to
declare an output path for anvi'o to report a NEXUS
format partition file in addition to your FASTA output
(requested by Massimiliano Molari in #1333).
--separator STRING Characters to separate things (the default is whatever
is most suitable).
--align-with ALIGNER The multiple sequence alignment program to use when
multiple sequence alignment is necessary. To see all
available options, use the flag `--list-aligners`.
--list-aligners Show available software for multiple sequence
alignment.
LIFE SAVERS: Just when you need them.
--just-do-it Don't bother me with questions or warnings, just do
it.
--dry-run Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
Get sequences for HMM hits from many inputs.
Example uses and other resources
Usage
anvi-get-sequences-for-hmm-hits [-h] [-c CONTIGS_DB] [-p PROFILE_DB]
[-C COLLECTION_NAME] [-b BIN_NAME]
[-B FILE_PATH] [-e FILE_PATH]
[-i FILE_PATH]
[--hmm-sources SOURCE NAME]
[--gene-names HMM HIT NAME] [-l] [-L]
[-o FILE_PATH] [--no-wrap]
[--get-aa-sequences]
[--concatenate-genes]
[--partition-file FILE_PATH]
[--max-num-genes-missing-from-bin INTEGER]
[--min-num-bins-gene-occurs INTEGER]
[--align-with ALIGNER]
[--separator STRING]
[--return-best-hit] [--just-do-it]
Parameters
INPUT OPTION #1: CONTIGS DB: There are multiple ways to access to sequences. Your first option is to provide a contigs database, and call it a day. In this case the program will return you everything from it.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
INPUT OPTION #2: CONTIGS DB + PROFLIE DB: You can also work with anvi'o profile databases and collections stored in them. If you go this way, you still will need to provide a contigs database. If you just specify a collection name, you will get hits from every bin in it. You can also use the bin name or bin ids file parameters to specify your interest more precisely.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
INPUT OPTION #3: INT/EXTERNAL GENOMES FILE: Yes. You can alternatively use as input an internal or external genomes file, or both of them together. If you have multiple contigs databases without any profile database, you can use the external genomes file. So if you just have a bunch of FASTA files and nothing else, this is what you need. In contrast, if you want to access to genes in bins described in collections stored in anvi'o profile databases, then you can use internal genomes file route. Or you can mix the two, because why not. There is not much room for excuses here.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
line should read 'name', and the second should read
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
-i FILE_PATH, --internal-genomes FILE_PATH
A five-column TAB-delimited flat text file. The header
line must contain these columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.
HMM STUFF: This is where you can specify an HMM source, and/or a list of genes to filter your results.
--hmm-sources SOURCE NAME
Get sequences for a specific list of HMM sources. You
can list one or more sources by separating them from
each other with a comma character (i.e., '--hmm-
sources source_1,source_2,source_3'). If you would
like to see a list of available sources in the contigs
database, run this program with '--list-hmm-sources'
flag.
--gene-names HMM HIT NAME
Get sequences only for a specific gene name. Each name
should be separated from each other by a comma
character. For instance, if you want to get back only
RecA and Ribosomal_L27, you can type '--gene-names
RecA,Ribosomal_L27', and you will get any and every
hit that matches these names in any source. If you
would like to see a list of available gene names, you
can use '--list-available-gene-names' flag.
-l, --list-hmm-sources
List available HMM sources in the contigs database and
quit.
-L, --list-available-gene-names
List available gene names in HMM sources selection and
quit.
THE OUTPUT: Where should the output go. It will be a FASTA file, and you better give it a nice name..
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--no-wrap Do not be wrap sequences nicely in the output file.
THE ALPHABET: The sequences are reported in DNA alphabet, but you can also get them translated just like all the other cool kids.
--get-aa-sequences Store amino acid sequences instead.
PHYLOGENOMICS? K!: If you want, you can get your sequences concatanated. In this case anwi'o
will use muscle to align every homolog, and concatenate them the order you
specified using the gene-names
argument. Each concatenated sequence will
be separated from the other ones by the separator
.
--concatenate-genes Concatenate output genes in the same order to create a
multi-gene alignment output that is suitable for
phylogenomic analyses.
--partition-file FILE_PATH
Some commonly used software for phylogenetic analyses
(e.g., IQ-TREE, RAxML, etc) allow users to
specify/test different substitution models for each
gene of a concatenated multiple sequence alignments.
For this, they use a special file format called a
'partition file', which indicates the site for each
gene in the alignment. You can use this parameter to
declare an output path for anvi'o to report a NEXUS
format partition file in addition to your FASTA output
(requested by Massimiliano Molari in #1333).
--max-num-genes-missing-from-bin INTEGER
This filter removes bins (or genomes) from your
analysis. If you have a list of gene names, you can
use this parameter to omit any bin (or external
genome) that is missing more than a number of genes
you desire. For instance, if you have 100 genome bins,
and you are interested in working with 5 ribosomal
proteins, you can use '--max-num-genes-missing-from-
bin 4' to remove the bins that are missing more than 4
of those 5 genes. This is especially useful for
phylogenomic analyses. Parameter 0 will remove any bin
that is missing any of the genes.
--min-num-bins-gene-occurs INTEGER
This filter removes genes from your analysis. Let's
assume you have 100 bins to get sequences for HMM
hits. If you want to work only with genes among all
the hits that occur in at least X number of bins, and
discard the rest of them, you can use this flag. If
you say '--min-num-bins-gene-occurs 90', each gene in
the analysis will be required at least to appear in 90
genomes. If a gene occurs in less than that number of
genomes, it simply will not be reported. This is
especially useful for phylogenomic analyses, where you
may want to only focus on genes that are prevalent
across the set of genomes you wish to analyze.
--align-with ALIGNER The multiple sequence alignment program to use when
multiple sequence alignment is necessary. To see all
available options, use the flag `--list-aligners`.
--separator STRING A word that will be used to sepaate concatenated gene
sequences from each other (IF you are using this
program with `--concatenate-genes` flag). The default
is "XXX" for amino acid sequences, and "NNN" for DNA
sequences
OPTIONAL: Everything is optional, but some options are more optional than others.
--return-best-hit A bin may contain more than one hit for a gene name in
a given HMM source. For instance, there may be
multiple RecA hits in a genome bin from Campbell et
al.. Using this flag, will go through all of the gene
names that appear multiple times, and remove all but
the one with the lowest e-value. Good for whenever you
really need to get only a single copy of single-copy
core genes from a genome bin.
--just-do-it Don't bother me with questions or warnings, just do
it.
Get short reads back from a BAM file with options for compression, splitting of forward and reverse reads, etc.
Usage
anvi-get-short-reads-from-bam [-h] -p PROFILE_DB -c CONTIGS_DB
[-C COLLECTION_NAME] [-b BIN_NAME]
[-B FILE_PATH] [-o FILE_PATH]
[-O FILENAME_PREFIX] [-X] [-Q]
BAM FILE[S] [BAM FILE[S] ...]
Parameters
positional arguments:
BAM FILE[S] BAM file(s) to access to recover short reads
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
-X, --gzip-output When declared, output file(s) will be gzip compressed
and the extension `.gz` will be added.
-Q, --split-R1-and-R2
When declared, this program outputs 3 FASTA files for
paired-end reads: one for R1, one for R2, and one for
unpaired reads.
Recover short reads from BAM files that were mapped to genes you are interested in. It is possible to work with a single gene call, or a bunch of them. Similarly, you can get short reads from a single BAM file, or from many of them.
metagenomics
profile_db
contigs_db
bam
variability
clustering
Usage
anvi-get-short-reads-mapping-to-a-gene [-h] -c CONTIGS_DB -i
INPUT_BAMS) [INPUT_BAM(S ...]
[--gene-caller-id GENE_CALLER_ID]
[--genes-of-interest FILE]
[--leeway LEEWAY_NTs]
[-O FILENAME_PREFIX]
Parameters
INPUT FILES: An anvi'o contigs database and one or more BAM files.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-i INPUT_BAM(S) [INPUT_BAM(S) ...], --input-files INPUT_BAM(S) [INPUT_BAM(S) ...]
Sorted and indexed BAM files to analyze. It is
essential that all BAM files must be the result of
mappings against the same contigs.
GENES: Gene calls you want to work with
--gene-caller-id GENE_CALLER_ID
A single gene id.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
--leeway LEEWAY_NTs The minimum number of nucleotides for a given short
read mapping into the gene context for it to be
reported. You must consider the length of your short
reads, as well as the length of the gene you are
targeting. The default is 100 nts.
OUTPUT: How should results be stored.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
Export splits and the coverage table from database
Usage
anvi-get-split-coverages [-h] -p PROFILE_DB [--split-name SPLIT_NAME]
[-c CONTIGS_DB] [-C COLLECTION_NAME]
[-b BIN_NAME] [-o FILE_PATH] [--list-splits]
[--list-collections] [--list-bins]
Parameters
ESSENTIAL ANVI'O DB: You need to provide a profile database.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
INPUT OPTION #1: SPLIT NAME: You want nothing but the coverage values in a single split. FINE.
--split-name SPLIT_NAME
Split name.
INPUT OPTION #2: COLLECTION + BIN: You want nucletide-level coverage values for all splits in a bin. FANCY.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
BORING STUFF: The output file and all.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--list-splits When declared, the program will list split names in
the profile database and quite
--list-collections Show available collections and exit.
--list-bins List available bins in a collection and exit.
Search for anvi'o programs by keyword, inputs/outputs, etc.
Usage
anvi-help [-h] [--requires] [--provides] [--name] [--report REPORT]
search-term
Parameters
positional arguments:
search-term Find programs associated with this search term. if you
want all programs, use 'ALL'
optional arguments:
--requires, -r Restrict to programs that require this search term
--provides, -p Restrict to programs that provide this search term
--name, -n Restrict to programs that contain this search term in
their name
--report REPORT, -R REPORT
Which information would you like to be in the report?
Mess with this if you are disappointed with the
default. Possibles are Description, Tags, Requires,
Provides, Status, and Resources. Add multiple of them
with commas (no whitespace). For example, if you
wanted Description and Resources, you would put here
Description,Resources
Import an external binning result into anvi'o
Usage
anvi-import-collection [-h] [-c CONTIGS_DB] [-p PAN_OR_PROFILE_DB] -C
COLLECTION_NAME [--bins-info BINS_INFO]
[--contigs-mode]
TAB DELIMITED FILE
Parameters
positional arguments:
TAB DELIMITED FILE The input file that describes bin IDs for each split
or contig.
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--bins-info BINS_INFO
Additional information for bins. The file must contain
three TAB-delimited columns, where the first one must
be a unique bin name, the second should be a 'source',
and the last one should be a 7 character HTML color
code (i.e., '#424242'). Source column must contain
information about the origin of the bin. If these bins
are automatically identified by a program like
CONCOCT, this column could contain the program name
and version. The source information will be associated
with the bin in various interfaces so in a sense it is
not *that* critical what it says there, but on the
other hand it is, becuse we should also think about
people who may end up having to work with what we put
together later.
--contigs-mode Use this flag if your binning was done on contigs
instead of splits. Please refer to the documentation
for help.
Parse and store functional annotation of genes.
Example uses and other resources
Usage
anvi-import-functions [-h] -c CONTIGS_DB [-p PARSER] -i FILES)
[FILE(S ...] [--drop-previous-annotations]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PARSER, --parser PARSER
Parser to make sense of the input files (if you need
one). There are currently 1 parsers readily available:
['interproscan']. IT IS OK if you do not select a
parser if you have a standard, TAB-delimited input
file for funcitonal annotation of genes. If this is
not like 2018 and everything is already outdated, you
should be able to go to this address and learn
everything you need like a boss:
http://merenlab.org/2016/06/18/importing-functions/
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
One or more input files should follow this parameter.
The way these files will be handled will depend on
which parser you selected (if you did select any).
--drop-previous-annotations
Use this flag if you want anvi'o to remove ALL
previous functional annotations for your genes, and
then import the new data. The default behavior will
add any annotation source into the db incrementally
unless there are already annotations from this source.
In which case, it will first remove previous
annotations for that source only (i.e., if source X is
both in the db and in the incoming annotations data,
it will replace the content of source X in the db).
Import a new items order into an anvi'o database
Usage
anvi-import-items-order [-h] [-i FILE] [-p DB PATH] [--name ORDER NAME]
[--make-default]
Parameters
CRITICAL INPUT: Basically the input file and the target database
-i FILE, --input-order FILE
One of the two important things you must provide: the
file that contains the items order. The format of this
file is important. It can either contain a proper
newick tree in it, or a complete list of 'items' in
the target database where every line of the file is
simply an item name. If you are providing a newick
tree, the entire file should be a single line. I know
it sounds hard, but you seriously can do this.
-p DB PATH, --db-path DB PATH
An appropriate anvi'o database to import the items
order. Currently it can be a profile, pan, or genes
database. But you should try your chances with other
kinds of databases for fun and games. Basically, if
the database contains an items order table, then
things will work. Otherwise, you will probably get
angry errors back in the worst case scenario.
NOT SO CRITICAL INPUT: Because not all parameters are created equal
--name ORDER NAME What should we call this order? Give it a concise,
single-word name.
--make-default You have the option to make this order the default
order in the database. Which means, anvi'o will use
this one when someone runs the program anvi-
interactive and presses draw. Big responsibility. But
if you have a 'default' state, it will not work
because the default items order in the state file
overwrites the one that comes from the database. So
not that big of a responsibility.
Populate additional data or order tables in pan or profile databases for items and layers, OR additional data in contigs databases for nucleotides and amino acids (the Swiss army knife-level serious stuff).
Example uses and other resources
Usage
anvi-import-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB] -t
NAME [-D NAME] [--transpose] [--just-do-it]
TAB DELIMITED FILE
Parameters
positional arguments:
TAB DELIMITED FILE The input file that describes an additional data for
layers or items. The expected format of this file
depends on the data table you will target. This can
feel complicated, but we promise it is not (you
probably have a PhD or working on one, so trust us
when we say "it is not complicated"). You need to read
the online documentation if this is your first time
with this.
Database input: Provide 1 of these
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
Details: Everything else.
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
documentation for more information.
-D NAME, --target-data-group NAME
Data group to focus. Anvi'o misc data tables support
associating a set of data keys with a data group. If
you have no idea what this is, then probably you don't
need it, and anvi'o will take care of you. Note: this
flag is IRRELEVANT if you are working with additional
order data tables.
--transpose Transpose the input matrix file before clustering.
--just-do-it Don't bother me with questions or warnings, just do
it.
Import an anvi'o state into a profile database.
Usage
anvi-import-state [-h] -p PAN_OR_PROFILE_DB -s STATE_FILE -n STATE_NAME
Parameters
optional arguments:
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-s STATE_FILE, --state STATE_FILE
JSON serializable anvi'o state file.
-n STATE_NAME, --name STATE_NAME
State name.
Import gene-level taxonomy into an anvi'o contigs database.
Usage
anvi-import-taxonomy-for-genes [-h] -c CONTIGS_DB [-p PARSER] -i FILES)
[FILE(S ...] [--just-do-it]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PARSER, --parser PARSER
Parser to make sense of the input files. There are 3
parsers readily available: ['default_matrix',
'centrifuge', 'kaiju']. It is OK if you do not select
a parser, but in that case there will be no additional
contigs available except the identification of single-
copy genes in your contigs for later use. Using a
parser will not prevent the analysis of single-copy
genes, but make anvio more powerful to help you make
sense of your results. Please see the documentation,
or get in touch with the developers if you have any
questions regarding parsers.
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
Input file(s) for selected parser. Each parser (except
"blank") requires input files to process that you
generate before running anvio. Please see the
documentation for details.
--just-do-it Don't bother me with questions or warnings, just do
it.
Import layers-level taxonomy into an anvi'o additional layer data table in an anvi'o single-profile database.
Usage
anvi-import-taxonomy-for-layers [-h] -p PROFILE_DB [--parser PARSER] -i
FILES) [FILE(S ...]
[--min-abundance PERCENTAGE]
Parameters
optional arguments:
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
--parser PARSER Parser to make sense of the input files. There are 1
parsers readily available: ['krakenuniq'].
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
Input file(s) for selected parser. Each parser (except
"blank") requires input files to process that you
generate before running anvio. Please see the
documentation for details.
--min-abundance PERCENTAGE
Short read-based taxonomy can be extremely noisy.
Therefore, here we have defeault minimum percentage
cutoff of 0.1% to eliminate any taxon that occurs less
than that in a given input file.
Sort/Index BAM files
Usage
anvi-init-bam [-h] [-o FILE_PATH] [-T NUM_THREADS] BAM_FILE
Parameters
positional arguments:
BAM_FILE BAM file to analyze
optional arguments:
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
Start an anvi'o inspect interactive interface.
Usage
anvi-inspect [-h] [-p PROFILE_DB] [-c CONTIGS_DB]
[--split-name SPLIT_NAME] [--hide-outlier-SNVs]
[-I IP_ADDR] [-P INT] [--server-only] [--just-do-it]
Parameters
DEFAULT INPUTS: The interactive interface can be started with anvi'o databases.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--split-name SPLIT_NAME
Split name.
VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.
--hide-outlier-SNVs During profiling, anvi'o marks positions of single-
nucleotide variations (SNVs) that originate from
places in contigs where coverage values are a bit
'sketchy'. If you would like to avoid SNVs in those
positions of splits in applicable projects you can use
this flag, and the interface would hide SNVs that are
marked as 'outlier' (although it is clearly the best
to see everything, no one will judge you if you end up
using this flag) (plus, there may or may not be some
historical data on this here:
https://github.com/meren/anvio/issues/309).
SERVER CONFIGURATION: For power users.
-I IP_ADDR, --ip-address IP_ADDR
IP address for the HTTP server. The default ip address
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--server-only The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.
GENERAL CONVENIENCE: From anvi'o developers to you.
--just-do-it Don't bother me with questions or warnings, just do
it.
Start an anvi'o server for the interactive interface
Example uses and other resources
Usage
anvi-interactive [-h] [-p PROFILE_DB] [-c CONTIGS_DB]
[-C COLLECTION_NAME] [--manual-mode] [-f FASTA]
[-d VIEW_DATA] [-t NEWICK] [--items-order FLAT_FILE]
[-V ADDITIONAL_VIEW] [-A ADDITIONAL_LAYERS]
[--gene-mode] [--inseq-stats] [-b BIN_NAME]
[--view NAME] [--title NAME]
[--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
[--split-hmm-layers] [--hide-outlier-SNVs]
[--state-autoload NAME] [--collection-autoload NAME]
[--export-svg FILE_PATH] [--show-views]
[--skip-check-names] [-o DIR_PATH] [--dry-run]
[--show-states] [--list-collections]
[--skip-init-functions] [--skip-auto-ordering]
[--distance DISTANCE_METRIC]
[--linkage LINKAGE_METHOD] [-I IP_ADDR] [-P INT]
[--browser-path PATH] [--read-only] [--server-only]
[--password-protected] [--user-server-shutdown]
Parameters
DEFAULT INPUTS: The interactive interface can be started with and without anvi'o databases. The default use assumes you have your profile and contigs database, however, it is also possible to start the interface using ad hoc input files. See 'MANUAL INPUT' section for required parameters.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
If you have a collection in your profile database, you
can use this flag to start the interactive interface
with a tree showing your bins in your collection,
instead of each split. This is very useful when you
have imported your external binning results into
anvi'o, and want to see the distribution of your bins
across samples. In these cases anvi'o will cluster
your bins and based on multiple metrics. Because this
particular clustering will be done on the fly within
anvi'o interactive class, you get to define a
disntance metric and a linkage method using --linkage
and --distance parameters if you want!
MANUAL INPUTS: Mandatory input parameters to start the interactive interface without anvi'o databases.
--manual-mode Using this flag, you can run the interactive interface
in an ad hoc manner using input files you curated
instead of standard output files generated by an
anvi'o run. In the manual mode you will be asked to
provide a profile database. In this mode a profile
database is only used to store 'state' of the
interactive interface so you can reload your visual
settings when you re-analyze the same files again. If
the profile database you provide does not exist,
anvi'o will create an empty one for you.
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-d VIEW_DATA, --view-data VIEW_DATA
A TAB-delimited file for view data
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure
--items-order FLAT_FILE
A flat file that contains the order of items you wish
the display using the interactive interface. You may
want to use this if you have a specific order of items
in your mind, and do not want to display a tree in the
middle (or simply you don't have one). The file format
is simple: each line should have an item name, and
there should be no header.
ADDITIONAL STUFF: Parameters to provide additional layers, views, or layer data.
-V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
A TAB-delimited file for an additional view to be used
in the interface. This file should contain all split
names, and values for each of them in all samples.
Each column in this file must correspond to a sample
name. Content of this file will be called 'user_view',
which will be available as a new item in the 'views'
combo box in the interface
-A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.
GENE MODE: Gene mode related parameters.
--gene-mode Initiate the interactive interface in 'gene mode'. In
this mode, the items are genes (instead of splits of
contigs). The following views are available: detection
(the detection value of each gene in each sample). The
mean_coverage (the mean coverage of genes). The
non_outlier_mean_coverage (the mean coverage of the
non-outlier nucleotide positions of each gene in each
sample (median absolute deviation is used to remove
outliers per gene per sample)). The
non_outlier_coverage_std view (standard deviation of
the coverage of non-outlier positions of genes in
samples). You can also choose to order items and
layers according to each one of the aforementioned
views. In addition, all layer ordering that are
available in the regular mode (i.e. the full mode
where you have contigs/splits) are also available in
'gene mode', so that, for example, you can choose to
order the layers according to 'detection', and that
would be the order according to the detection values
of splits, whereas if you choose 'genes_detections'
then the order of layers would be according to the
detection values of genes. Inspection and sequence
functionality are available (through the right-click
menu), except now sequences are of the specific gene.
Inspection has now two options available: 'Inspect
Context', which brings you to the inspection page of
the split to which the gene belongs where the
inspected gene will be highlighted in yellow in the
bottom, and 'Inspect Gene', which opens the inspection
page only for the gene and 100 nts around each side of
it (the purpose of this option is to make the
inspection page load faster if you only want to look
at the nucleotide coverage of a specific gene).
NOTICE: You can't store states or collections in 'gene
mode'. However, you still can make fake selections,
and create fake bins for your viewing convenience only
(smiley). Search options are available, and you can
even search for functions if you have them in your
contigs database. ANOTHER NOTICE: loading this mode
might take a while if your bin has many genes, and
your profile database has many samples, this is
because the gene coverages stats are computed in an
ad-hoc manner when you load this mode, we know this is
not ideal and we plan to improve that (along with
other things). If you have suggestions/complaints
regarding this mode please comment on this github
issue: https://goo.gl/yHhRei. Please refer to the
online tutorial for more information.
--inseq-stats Provide if working with INSeq/Tn-Seq genomic data.
With this, all gene level coverage stats will be
calculated using INSeq/Tn-Seq statistical methods.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.
--view NAME Start the interface with a pre-selected view. To see a
list of available views, use --show-views flag.
--title NAME Title for the interface. If you are working with a
RUNINFO dict, the title will be determined based on
information stored in that file. Regardless, you can
override that value using this parameter.
--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
The taxonomic level to use whenever relevant and/or
available. The default taxonomic level is t_genus, but
if you choose something specific, anvi'o will focus on
that whenever possible.
--split-hmm-layers When declared, this flag tells the interface to split
every gene found in HMM searches that were performed
against non-singlecopy gene HMM profiles into their
own layer. Please see the documentation for details.
--hide-outlier-SNVs During profiling, anvi'o marks positions of single-
nucleotide variations (SNVs) that originate from
places in contigs where coverage values are a bit
'sketchy'. If you would like to avoid SNVs in those
positions of splits in applicable projects you can use
this flag, and the interface would hide SNVs that are
marked as 'outlier' (although it is clearly the best
to see everything, no one will judge you if you end up
using this flag) (plus, there may or may not be some
historical data on this here:
https://github.com/meren/anvio/issues/309).
--state-autoload NAME
Automatically load previous saved state and draw tree.
To see a list of available states, use --show-states
flag.
--collection-autoload NAME
Automatically load a collection and draw tree. To see
a list of available collections, use --list-
collections flag.
--export-svg FILE_PATH
The SVG output file path.
SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).
--show-views When declared, the program will show a list of
available views, and exit.
--skip-check-names For debugging purposes. You should never really need
it.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
--dry-run Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--show-states When declared the program will print all available
states and exit.
--list-collections Show available collections and exit.
--skip-init-functions
When declared, function calls for genes will not be
initialized (therefore will be missing from all
relevant interfaces or output files). The use of this
flag may reduce the memory fingerprint and processing
time for large datasets.
--skip-auto-ordering When declared, the attempt to include automatically
generated orders of items based on additional data is
skipped. In case those buggers cause issues with your
data, and you still want to see your stuff and deal
with the other issue maybe later.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
Only relevant if you are running the interactive
interface in "collection" mode. The default is
"euclidean".
--linkage LINKAGE_METHOD
The linkage method for the hierarchical clustering.
Only relevant if you are running the interactive
interface in "collection" mode. The default is "ward".
SERVER CONFIGURATION: For power users.
-I IP_ADDR, --ip-address IP_ADDR
IP address for the HTTP server. The default ip address
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--read-only When the interactive interface is started with this
flag, all 'database write' operations will be
disabled.
--server-only The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.
--password-protected If this flag is set, command line tool will ask you to
enter a password and interactive interface will be
only accessible after entering same password. This
option is recommended for shared machines like
clusters or shared networks where computers are not
isolated.
--user-server-shutdown
Allow users to shutdown an anvi'server via web
interface.
Takes a distance matrix, returns a newick tree.
Usage
anvi-matrix-to-newick [-h] [-o FILE_PATH]
[--items-order-file FILE PATH] [--transpose]
[--distance DISTANCE_METRIC]
[--linkage LINKAGE_METHOD]
PATH
Parameters
INPUT: The data you wish to cluster
PATH Input matrix
OUTPUT: How would you like your results to be reported?
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--items-order-file FILE PATH
In addition to a newick formatted output file, you can
ask anvi'o to report the order of items in the
resulting tree in a separate file. The content of this
file will be a single-column item names the way they
are ordered in the output newick dendrogram.
SWEETS: Additional options
--transpose Transpose the input matrix file before clustering.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
The default distance metric is 'euclidean'. You can
find the full list of distance metrics either by
making a mistake (such as entering a non-existent
distance metric and making anvi'o upset), or by taking
a look at the help menu of the
hierarchy.distance.pdist function in the scipy.cluster
module.
--linkage LINKAGE_METHOD
The linkage method for the hierarchical clustering.
The default linkage method is 'ward', because that is
the best one. It really is. We talked to a lot of
people and they were all like 'this is the best one
available' and it is just all out there. Honestly it
is so good that we will build a wall around it and
make other linkage methods pay for it. But if you want
to see a full list of available ones you can check the
hierarcy.linkage function in the scipy.cluster module.
Up to you really. But then you can't use ward anymore,
and you would have to leave anvi'o right now.
A program to classify genes according to coverage across multiple metagenomes
Usage
anvi-mcg-classifier [-h] -p PROFILE_DB -c CONTIGS_DB
[-O FILENAME_PREFIX] [-C COLLECTION_NAME]
[-b BIN_NAME] [-B FILE_PATH]
[--exclude-samples FILE] [--include-samples FILE]
[--gen-figures] [--get-samples-stats-only] [-W]
[--alpha NUM] [--outliers-threshold NUM]
[--zeros-are-outliers]
Parameters
ESSENTIAL INPUTS: You must supply a merged profile db (along with a matching contigs db)
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
ESSENTIAL OUTPUTS: The outputs of the algorithm are: an anvio additional layers format file with the classification information for genes. An anvio samples information file with detectino information per sample. In addition, when a profile database is given then a gene-coverages, and gene-detection tables would also be saved. All files are created with the prefix that is provided by the user.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
ADDITIONAL STUFF: Parameters to provide pre-existing additional layers, samples-information files, so that the outputs would be added to these files
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
--exclude-samples FILE
List of samples to exclude for the analysis.
--include-samples FILE
List of samples to include for the analysis.
--gen-figures For those of you who wish to dig deeper, a collection
of figures could be created to allow you to get
insight into how the classification was generated.
This is especially useful to identify cases in which
you shouldn't trust the classification (for example
due to a large number of outliers). NOTICE: if you ask
anvi'o to generate these figures then it will
significantly extend the execution time. To learn
about which figures are created and what they mean,
contact your nearest anvi'o developer, because
currently it is a well-hidden secret.
--get-samples-stats-only
If you only wish to get statistics regarding the
occurrence of bins in samples, then use this flag.
Especially when dealing with many samples or large
genomes, gene stats could be a long time to compute.
By using this flag you could save a lot of computation
time.
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
PARAMETERS: Parameters to determine cut-offs for the gene-classifier
--alpha NUM, --genome-detection-uncertainty NUM
Determines the range of sample detection values that
are considered negative, ambiguous or positive. Min of
0 and smaller than 0.5, default of 0.25. For exmaple
for the default samples with detection below 0.5-0.25
= 0.25 will be considered negative (i.e. donot contain
the genome), samples with detection between 0.25 and
0.75 would be ambiguous (and hence would not be used
for the classification), and samples with detection
above 0.75 would be considered positive (i.e. contain
the genome).
--outliers-threshold NUM
Threshold to use for the outlier detection. The
default value is '1.5'. Absolute deviation around the
median is used. To read more about the method please
refer to: 'How to Detect and Handle Outliers' by Boris
Iglewicz and David Hoaglin
(doi:10.1016/j.jesp.2013.03.013).
--zeros-are-outliers If you want all zero coverage positions to be treated
like outliers then use this flag. The reason to treat
zero coverage as outliers is because when mapping
reads to a reference we could get many zero positions
due to accessory genes. These positions then skew the
average values that we compute.
Merge multiple anvio profiles
Usage
anvi-merge [-h] -c CONTIGS_DB [-o DIR_PATH] [-S NAME]
[--description TEXT_FILE] [--skip-hierarchical-clustering]
[--enforce-hierarchical-clustering]
[--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD] [-W]
SINGLE_PROFILES) [SINGLE_PROFILE(S ...]
Parameters
positional arguments:
SINGLE_PROFILE(S) Anvo'o single profiles to merge
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-S NAME, --sample-name NAME
It is important to set a sample name (using only ASCII
letters and digits and without spaces) that is unique
(considering all others). If you do not provide one,
anvi'o will try to make up one for you based on other
information, although, you should never let the
software to decide these things).
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
--skip-hierarchical-clustering
If you are not planning to use the interactive
interface (or if you have other means to add a tree of
contigs in the database) you may skip the step where
hierarchical clustering of your items are preformed
based on default clustering recipes matching to your
database type.
--enforce-hierarchical-clustering
If you have more than 25,000 splits in your merged
profile, anvi-merge will automatically skip the
hierarchical clustering of splits (by setting --skip-
hierarchical-clustering flag on). This is due to the
fact that computational time required for hierarchical
clustering increases exponentially with the number of
items being clustered. Based on our experience we
decided that 25,000 splits is about the maximum we
should try. However, this is not a theoretical limit,
and you can overwrite this heuristic by using this
flag, which would tell anvi'o to attempt to cluster
splits regardless.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
If you do not use this flag, the default distance
metric will be used for each clustering configuration
which is "euclidean".
--linkage LINKAGE_METHOD
The same story with the `--distance`, except, the
system default for this one is ward.
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
Merge a given set of bins in an anvi'o collection
Usage
anvi-merge-bins [-h] -p PAN_OR_PROFILE_DB [-C COLLECTION_NAME]
[-b BIN NAMES] [-B BIN NAME] [--list-collections]
[--list-bins]
Parameters
DB AND COLLECTION: Simple enough. This guy needs a pan or profile database and a collection name. You can get a list of available collections with another flag down below.
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
BINS TO WORK WITH: Here you need to define a list of bin names to merge, and the new bin name for them to merge under. Your bin names should be comma-separated. Both 'name_1, name_2, name_3' and name_1,name_2,name_3 will work. Your new bin name better be a single word, meaningful name so anvi'o does not complain about it later.
-b BIN NAMES, --bin-names-list BIN NAMES
Comma-separated list of bin names.
-B BIN NAME, --new-bin-name BIN NAME
The new bin name.
SWEET FLAGS OF CONVENIENCE: We gotchu.
--list-collections Show available collections and exit.
--list-bins List available bins in a collection and exit.
Convert a pangenome into a metapangenome.
Usage
anvi-meta-pan-genome [-h] -p PAN_DB [-g GENOMES_STORAGE] [-i FILE]
[--fraction-of-median-coverage FLOAT]
[--min-detection FLOAT]
Parameters
PANGENOME: Files for the pangenome.
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
METAGENOME: Genome bins stored in an anvi'o profile databases as collections.
-i FILE, --internal-genomes FILE
A four-column TAB-delimited flat text file. This file
should be identical to the internal genomes file you
used for your pangenomics analysis. Anvi'o will use
this file to find all profile and contigs databases
that contain the information for each gene and genome
across metagenomes.
CRITERION FOR DETECTION: This is tricky. What we want to do is to identify genes that are occurring uniformly across samples.
--fraction-of-median-coverage FLOAT
The value set here will be used to remove a gene if
its total coverage across environments is less than
the median coverage of all genes multiplied by this
value. The default is 0.25, which means, if the median
total coverage of all genes across all samples is
100X, then, a gene with a total coverage of less than
25X across all samples will be assumed not a part of
the 'environmental core'.
--min-detection FLOAT
For this entire thing to work, the genome you are
focusing on should be detected in at least one
metagenome. If that is not the case, it would mean
that you do not have any sample that represents the
niche for this organism (or you do not have enough
depth of coverage) to investigate the detection of
genes in the environment. By default, this script
requires at least '0.5' of the genome to be detected
in at least one metagenome. This parameter allows you
to change that. 0 would mean no detection test
required, 1 would mean the entire genome must be
detected.
Migrate an anvi'o database or config file to a newer version.
Usage
anvi-migrate [-h] [--migrate-dbs-safely] [--migrate-dbs-quickly]
[--just-do-it] [-t VERSION]
DATABASES) [DATABASE(S ...]
Parameters
INPUTS: You will literally give us any anvi'o database.
DATABASE(S) Anvi'o database or config file for migration. You can
give many of them all at once. Running `anvi-migrate
*.db` in a directory will migrate all databases in
that directory.
SAFETY: It is up to you. Safe things take much longer and boring. Unsafe things are fast, fun, and .. well, don't come to use if your computer loses power or somiething.
--migrate-dbs-safely If you chose this, anvi'o will first create a copy of
your original database. If something goes wrong, it
will restore the original. If everything works, it
will remove the old copy. IF YOU HAVE DATABASES THAT
ARE VERY LARGE OR IF YOU ARE MIGRATING MANY MANY OF
THEM THIS OPTION WILL ADD A HUGE I/O BURDEN ON YOUR
SYSTEM. But still. Safety is safe.
--migrate-dbs-quickly
If you chose this, anvi'o will migrate your databases
in place. It will be much faster (and arguably more
fun) than the safe option, but if something goes
wrong, you will lose data. During the first five years
of anvi'o development not a single user lost data
using our migration scripts as far as we know. But
there is always a first, and today might be your lucky
day.
PARAMATERS OF CONVENIENCE: This is how anvi'o spoils you.
--just-do-it Don't bother me with questions or warnings, just do
it.
-t VERSION, --target-version VERSION
Anvi'o will stop upgrading your database when it
reaches to this version.
Takes an anvi'o linkmers report, generates an oligotyping output
Usage
anvi-oligotype-linkmers [-h] -i LINKMER_REPORT -o DIR_PATH
Parameters
optional arguments:
-i LINKMER_REPORT, --input-file LINKMER_REPORT
Output file of `anvi-report-linkmers`.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
An anvi'o program to compute a pangenome from an anvi'o genome storage.
Example uses and other resources
Usage
anvi-pan-genome [-h] -g GENOMES_STORAGE [-G GENOME_NAMES]
[--skip-alignments] [--skip-homogeneity]
[--quick-homogeneity] [--align-with ALIGNER]
[--exclude-partial-gene-calls] [--use-ncbi-blast]
[--minbit MINBIT] [--mcl-inflation INFLATION]
[--min-occurrence NUM_OCCURRENCE]
[--min-percent-identity PERCENT] [--sensitive]
[-n PROJECT_NAME] [--description TEXT_FILE]
[-o PAN_DB_DIR] [-W] [-T NUM_THREADS]
[--skip-hierarchical-clustering]
[--enforce-hierarchical-clustering]
[--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]
Parameters
GENOMES: The very fancy genomes storage file. This file is generated by the program
anvi-genomes-storage
. Please see the online tutorial on pangenomic
workflow if you don't know how to generate one.
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
-G GENOME_NAMES, --genome-names GENOME_NAMES
Genome names to 'focus'. You can use this parameter to
limit the genomes included in your analysis. You can
provide these names as a comma-separated list of
names, or you can put them in a file, where you have a
single genome name in each line, and provide the file
path.
PARAMETERS: Important stuff Tom never pays attention (but you should).
--skip-alignments By default, anvi'o attempts to align amino acid
sequences in each gene cluster using multiple sequnce
alignment via muscle. You can use this flag to skip
that step and be upset later.
--skip-homogeneity By default, anvi'o attempts to calculate homogeneity
values for every gene cluster, given that they are
aligned. You can use this flag to have anvi'o skip
homogeneity calculations. Anvi'o will ignore this flag
if you decide to skip alignments
--quick-homogeneity By default, anvi'o will use a homogeneity algorithm
that checks for horizontal and vertical geometric
homogeneity (along with functional). With this flag,
you can tell anvi'o to skip horizontal geometric
homogeneity calculations. It will be less accurate but
quicker. Anvi'o will ignore this flag if you skip
homogeneity calculations or alignments all together.
--align-with ALIGNER The multiple sequence alignment program to use when
multiple sequence alignment is necessary. To see all
available options, use the flag `--list-aligners`.
--exclude-partial-gene-calls
By default, anvi'o includes all partial gene calls
from the analysis, which, in some cases, may inflate
the number of gene clusters identified and introduce
extra heterogeneity within those gene clusters. Using
this flag, you can request anvi'o to exclude partial
gene calls from the analysis (whether a gene call is
partial or not is an information that comes directly
from the gene caller used to identify genes during the
generation of the contigs database).
--use-ncbi-blast This program uses DIAMOND by default, however, if you
like, you can use good ol' blastp from NCBI instead.
--minbit MINBIT The minimum minbit value. The minbit heuristic
provides a mean to set a to eliminate weak matches
between two amino acid sequences. We learned it from
ITEP (Benedict MN et al, doi:10.1186/1471-2164-15-8),
which is a comprehensive analysis workflow for
pangenomes, and decided to use it in the anvi'o
pangenomic workflow, as well. Briefly, If you have two
amino acid sequences, 'A' and 'B', the minbit is
defined as 'BITSCORE(A, B) / MIN(BITSCORE(A, A),
BITSCORE(B, B))'. So the minbit score between two
sequences goes to 1 if they are very similar over the
entire length of the 'shorter' amino acid sequence,
and goes to 0 if (1) they match over a very short
stretch compared even to the length of the shorter
amino acid sequence or (2) the match betwen sequence
identity is low. The default is 0.5.
--mcl-inflation INFLATION
MCL inflation parameter, that defines the sensitivity
of the algorithm during the identification of the gene
clusters. More information on this parameter and it's
effect on cluster granularity is here:
(http://micans.org/mcl/man/mclfaq.html#faq7.2). The
default is 2.
--min-occurrence NUM_OCCURRENCE
Do you not want singletons?\ You don't? Well, this
parameter will help you get rid of them (along with
doubletons, if you want). Anvi'o will remove gene
clusters that occur less than the number you set using
this parameter from the analysis. The default is 1,
which means everything will be kept. If you want to
remove singletons, set it to 2, if you want to remove
doubletons as well, set it to 3, and so on.
--min-percent-identity PERCENT
Minimum percent identity between the two amino acid
sequences for them to have an edge for MCL analysis.
This value will be used to filter hits from Diamond
search results. Because percent identity is not a
predictor of a good match (since it does not
communicate many other important factors such as the
alignment length between the two sequences and its
proportion to the entire length of those involved), we
suggest you rely on 'minbit' parameter. But you know
what? Maybe you shouldn't listen to anyone, and
experiment on your own! The default is 0 percent.
--sensitive DIAMOND sensitivity. With this flag you can instruct
DIAMOND to be 'sensitive', rather than 'fast' during
the search. It is likely the search will take
remarkably longer. But, hey, if you are doing it for
your final analysis, maybe it should take longer and
be more accurate. This flag is only relevant if you
are running DIAMOND.
OTHERS: Sweet parameters of convenience.
-n PROJECT_NAME, --project-name PROJECT_NAME
Name of the project. Please choose a short but
descriptive name (so anvi'o can use it whenever she
needs to name an output file, or add a new table in a
database, or name her first born).
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
-o PAN_DB_DIR, --output-dir PAN_DB_DIR
Directory path for output files
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
ORGANIZING GENE CLUSTERs: These are stuff that will change the clustering dendrogram of your gene clusters.
--skip-hierarchical-clustering
Anvi'o attempts to generate a hierarchical clustering
of your gene clusters once it identifies them so you
can use `anvi-display-pan` to play with it. But if you
want to skip this step, this is your flag.
--enforce-hierarchical-clustering
If you want anvi'o to try to generate a hierarchical
clustering of your gene clusters even if the number of
gene clusters exceeds its suggested limit for
hierarchical clustering, you can use this flag to
enforce it. Are you are a rebel of some sorts? Or did
computers made you upset? Express your anger towards
machine using this flag.
--distance DISTANCE_METRIC
The distance metric for the clustering of gene
clusters. If you do not use this flag, the default
distance metric will be used for each clustering
configuration which is "euclidean".
--linkage LINKAGE_METHOD
The same story with the `--distance`, except, the
system default for this one is ward.
Creates a single anvi'o profile database. The default input to this program is
a BAM file. When it is run on a BAM file, depending on the user parameters,
the program quantifies coverage per nucleotide position (and averages them out
per contig), calculates single-nucleotide, single-codon, and single-amino acid
variants, as well as structurel variants such as insertion and deletions and
stores these data into appropriate tables. Anvi'o single profiles can be
merged by the program anvi-merge
.
metagenomics
profile_db
contigs_db
bam
variability
clustering
Example uses and other resources
Usage
anvi-profile [-h] [-i INPUT_BAM] [-c CONTIGS_DB] [--blank-profile]
[-o DIR_PATH] [-W] [-S NAME] [--report-variability-full]
[--skip-SNV-profiling] [--skip-INDEL-profiling]
[--profile-SCVs] [--min-percent-identity PERCENT_IDENTITY]
[--description TEXT_FILE] [--cluster-contigs]
[--skip-hierarchical-clustering]
[--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]
[-M INT] [--max-contig-length INT] [-X INT] [-V INT]
[--list-contigs] [--contigs-of-interest FILE]
[-T NUM_THREADS] [--queue-size INT]
[--write-buffer-size-per-thread INT] [--force-multi]
Parameters
INPUTS: There are two possible inputs for anvio profiler. You must to declare either of these two.
-i INPUT_BAM, --input-file INPUT_BAM
Sorted and indexed BAM file to analyze. Takes a long
time depending on the length of the file and
parameters used for profiling.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--blank-profile If you only have contig sequences, but no mapping data
(i.e., you found a genome and would like to take a
look from it), this flag will become very hand. After
creating a contigs database for your contigs, you can
create a blank anvi'o profile database to use anvi'o
interactive interface with that contigs database
without any mapping data.
EXTRAS: Things that are not mandatory, but can be useful if/when declared.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
-S NAME, --sample-name NAME
It is important to set a sample name (using only ASCII
letters and digits and without spaces) that is unique
(considering all others). If you do not provide one,
anvi'o will try to make up one for you based on other
information, although, you should never let the
software to decide these things).
--report-variability-full
One of the things anvi-profile does is to store
information about variable nucleotide positions.
Usually it does not report every variable position,
since not every variable position is genuine
variation. Say, if you have 1,000 coverage, and all
nucleotides at that position are Ts and only one of
them is a C, the confidence of that C being a real
variation is quite low. anvi'o has a simple algorithm
in place to reduce the impact of noise. However, using
this flag you can disable it and ask profiler to
report every single variation (which may result in
very large output files and millions of reports, but
you are the boss). Do not forget to take a look at '--
min-coverage-for-variability' parameter
--skip-SNV-profiling By default, anvi'o characterizes single-nucleotide
variation in each sample. The use of this flag will
instruct profiler to skip that step. Please remember
that parameters and flags must be identical between
different profiles using the same contigs database for
them to merge properly.
--skip-INDEL-profiling
The alignment of a read to a reference genome/sequence
can be imperfect, such that the read exhibits
insertions or deletions relative to the reference.
Anvi'o normally stores this information in the profile
database since the time taken and extra storage do not
amount to much, but if insist on not having this
information, you can skip storing this information by
providing this flag. Note: If --skip-SNV-profiling is
provided, --skip-INDEL-profiling will automatically be
enforced.
--profile-SCVs Anvi'o can perform accurate characterization of codon
frequencies in genes during profiling. While having
codon frequencies opens doors to powerful evolutionary
insights in downstream analyses, due to its
computational complexity, this feature comes 'off' by
default. Using this flag you can rise against the
authority, as you always should, and make anvi'o
profile codons.
--min-percent-identity PERCENT_IDENTITY
Ignore any reads with a percent identity to the
reference less than this number, e.g. 95. If not
provided, all reads in the BAM file will be used (and
things will run faster).
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
HIERARCHICAL CLUSTERING: Do you want your splits to be clustered? Yes? No? Maybe? Remember: By
default, anvi-profile will not perform hierarchical clustering on your
splits; but if you use --blank
flag, it will try. You can skip that by
using the --skip-hierarchical-clustering
flag.
--cluster-contigs Single profiles are rarely used for genome binning or
visualization, and since clustering step increases the
profiling runtime for no good reason, the default
behavior is to not cluster contigs for individual
runs. However, if you are planning to do binning on
one sample, you must use this flag to tell anvi'o to
run cluster configurations for single runs on your
sample.
--skip-hierarchical-clustering
If you are not planning to use the interactive
interface (or if you have other means to add a tree of
contigs in the database) you may skip the step where
hierarchical clustering of your items are preformed
based on default clustering recipes matching to your
database type.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
Only relevant if you are using `--cluster-contigs`
flag. The default is "euclidean".
--linkage LINKAGE_METHOD
The linkage method for the hierarchical clustering.
Just like the distance metric this is only relevant if
you are using it with `--cluster-contigs` flag. The
default is "ward".
NUMBERS: Defaults of these parameters will impact your analysis. You can always come back to them and update your profiles, but it is important to make sure defaults are reasonable for your sample.
-M INT, --min-contig-length INT
Minimum length of contigs in a BAM file to analyze.
The minimum length should be long enough for tetra-
nucleotide frequency analysis to be meaningful. There
is no way to define a golden number of minimum length
that would be applicable to genomes found in all
environments, but we chose the default to be 1000, and
have been happy with it. You are welcome to
experiment, but we advise to never go below 1,000. You
also should remember that the lower you go, the more
time it will take to analyze all contigs. You can use
--list-contigs parameter to have an idea how many
contigs would be discarded for a given M.
--max-contig-length INT
Just like the minimum contig length parameter, but to
set a maximum. Basically this will remove any contig
longer than a certain value. Why would anyone need
this? Who knows. But if you ever do, it is here.
-X INT, --min-mean-coverage INT
Minimum mean coverage for contigs to be kept in the
analysis. The default value is 0, which is for your
best interest if you are going to profile multiple BAM
files which are then going to be merged for a cross-
sectional or time series analysis. Do not change it if
you are not sure this is what you want to do.
-V INT, --min-coverage-for-variability INT
Minimum coverage of a nucleotide position to be
subjected to SNV profiling. By default, anvi'o will
not attempt to make sense of variation in a given
nucleotide position if it is covered less than 10X.
You can change that minimum using this parameter.
CONTIGS: Sweet parameters of convenience
--list-contigs When declared, the program will list contigs in the
BAM file and exit gracefully without any further
analysis.
--contigs-of-interest FILE
It is possible to focus on only a set of contigs. If
you would like to do that and ignore the rest of the
contigs in your contigs database, use this parameter
with a flat file every line of which desribes a single
contig name.
PERFORMANCE: Performance settings for profiler
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--queue-size INT The queue size for worker threads to store data to
communicate to the main thread. The default is set by
the class based on the number of threads. If you have
*any* hesitation about whether you know what you are
doing, you should not change this value.
--write-buffer-size-per-thread INT
How many items should be kept in memory before they
are written do the disk. The default is 500 per
thread. So a single-threaded job would have a write
buffer size of 500, whereas a job with 4 threads would
have a write buffer size of 4*500. The larger the
buffer size, the less frequent the program will access
to the disk, yet the more memory will be consumed
since the processed items will be cleared off the
memory only after they are written to the disk. The
default buffer size will likely work for most cases.
Please keep an eye on the memory usage output to make
sure the memory use never exceeds the size of the
physical memory.
--force-multi This is not useful to non-developers. It forces the
multi-process routine even when 1 thread is chosen.
Push stuff to an anvi'server
Usage
anvi-push [-h] --user USERNAME [--api-url API_URL] -n PROJECT_NAME
[-t NEWICK] [--items-order FLAT_FILE] [-f FASTA]
[-d VIEW_DATA] [-A ADDITIONAL_LAYERS] [-s STATE]
[--description TEXT_FILE] [--bins BINS_DATA]
[--bins-info BINS_INFO] [--delete-if-exists]
Parameters
SERVER DETAILS: Details of how to access to an anvi'server instance.
--user USERNAME The user for an anvi'server.
--api-url API_URL Anvi'server url
PROJECT DETAILS: What to send to the server
-n PROJECT_NAME, --project-name PROJECT_NAME
Name of the project. Please choose a short but
descriptive name (so anvi'o can use it whenever she
needs to name an output file, or add a new table in a
database, or name her first born).
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure
--items-order FLAT_FILE
A flat file that contains the order of items you wish
the display using the interactive interface. You may
want to use this if you have a specific order of items
in your mind, and do not want to display a tree in the
middle (or simply you don't have one). The file format
is simple: each line should have an item name, and
there should be no header.
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-d VIEW_DATA, --view-data VIEW_DATA
A TAB-delimited file for view data
-A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.
-s STATE, --state STATE
State file, you can export states from database using
anvi-export-state program
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
--bins BINS_DATA Tab-delimited file, first column contains tree leaves
(gene clusters, splits, contigs etc.) and second
column contains which Bin they belong.
--bins-info BINS_INFO
Additional information for bins. The file must contain
three TAB-delimited columns, where the first one must
be a unique bin name, the second should be a 'source',
and the last one should be a 7 character HTML color
code (i.e., '#424242'). Source column must contain
information about the origin of the bin. If these bins
are automatically identified by a program like
CONCOCT, this column could contain the program name
and version. The source information will be associated
with the bin in various interfaces so in a sense it is
not *that* critical what it says there, but on the
other hand it is, becuse we should also think about
people who may end up having to work with what we put
together later.
RISKY CLICKS: As the name suggests!
--delete-if-exists Be bold (at your own risk), and delete if exists.
Start an anvi'o interactive interactive to manually curate or refine a genome, whether it is a metagenome-assembled, single-cell, or an isolate genome.
Example uses and other resources
Usage
anvi-refine [-h] -p PROFILE_DB -c CONTIGS_DB [-C COLLECTION_NAME]
[-b BIN_NAME] [-B FILE_PATH]
[--find-from-split-name SPLIT_NAME] [-t NEWICK]
[--skip-hierarchical-clustering] [--load-full-state]
[-V ADDITIONAL_VIEW] [-A ADDITIONAL_LAYERS]
[--split-hmm-layers]
[--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
[--hide-outlier-SNVs] [--title NAME]
[--export-svg FILE_PATH] [--dry-run]
[--skip-init-functions] [--skip-auto-ordering] [-I IP_ADDR]
[-P INT] [--browser-path PATH] [--read-only]
[--server-only] [--password-protected]
Parameters
DEFAULT INPUTS: The interavtive interface can be started with and without anvi'o databases. The default use assumes you have your profile and contigs database, however, it is also possible to start the interface using ad-hoc input files. See 'MANUAL INPUT' section for other set of parameters that are mutually exclusive with datanases.
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
REFINE-SPECIFICS: Parameters that are essential to the refinement process.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
--find-from-split-name SPLIT_NAME
If you don't know the bin name you want to work with
but if you know the split name it contains you can use
this parameter to tell anvi'o the split name, and so
it can find the bin for you automatically. This is
something extremely difficult for anvi'o to do, but it
does it anyway because you.
ADDITIONAL STUFF: Parameters to provide additional layers, views, or layer data.
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure
--skip-hierarchical-clustering
Skip hierarchical clustering for the splits in the
refined bin, if you skip clustering you need to
provide your own newick formatted tree using --tree
parameter.
--load-full-state Often the minimum and maximum values defined for the
an entire profile database that contains all contigs
do not scale well when you wish to work with a single
bin in the refine mode. For this reason, the default
behavior of anvi-refine is to ignore min/max values
set in the default state. This flag is your way of
telling anvi'o to not do that, and load the state
stored in the profile database as is.
-V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
A TAB-delimited file for an additional view to be used
in the interface. This file should contain all split
names, and values for each of them in all samples.
Each column in this file must correspond to a sample
name. Content of this file will be called 'user_view',
which will be available as a new item in the 'views'
combo box in the interface
-A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.
VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.
--split-hmm-layers When declared, this flag tells the interface to split
every gene found in HMM searches that were performed
against non-singlecopy gene HMM profiles into their
own layer. Please see the documentation for details.
--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
The taxonomic level to use whenever relevant and/or
available. The default taxonomic level is t_genus, but
if you choose something specific, anvi'o will focus on
that whenever possible.
--hide-outlier-SNVs During profiling, anvi'o marks positions of single-
nucleotide variations (SNVs) that originate from
places in contigs where coverage values are a bit
'sketchy'. If you would like to avoid SNVs in those
positions of splits in applicable projects you can use
this flag, and the interface would hide SNVs that are
marked as 'outlier' (although it is clearly the best
to see everything, no one will judge you if you end up
using this flag) (plus, there may or may not be some
historical data on this here:
https://github.com/meren/anvio/issues/309).
--title NAME Title for the interface. If you are working with a
RUNINFO dict, the title will be determined based on
information stored in that file. Regardless, you can
override that value using this parameter.
--export-svg FILE_PATH
The SVG output file path.
SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).
--dry-run Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--skip-init-functions
When declared, function calls for genes will not be
initialized (therefore will be missing from all
relevant interfaces or output files). The use of this
flag may reduce the memory fingerprint and processing
time for large datasets.
--skip-auto-ordering When declared, the attempt to include automatically
generated orders of items based on additional data is
skipped. In case those buggers cause issues with your
data, and you still want to see your stuff and deal
with the other issue maybe later.
SERVER CONFIGURATION: For power users.
-I IP_ADDR, --ip-address IP_ADDR
IP address for the HTTP server. The default ip address
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--read-only When the interactive interface is started with this
flag, all 'database write' operations will be
disabled.
--server-only The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.
--password-protected If this flag is set, command line tool will ask you to
enter a password and interactive interface will be
only accessible after entering same password. This
option is recommended for shared machines like
clusters or shared networks where computers are not
isolated.
Rename all bins in a given collection (so they have pretty names).
Usage
anvi-rename-bins [-h] -c CONTIGS_DB -p PROFILE_DB
[--collection-to-read COLLECTION_TO_READ]
[--collection-to-write COLLECTION_TO_WRITE]
[--prefix PREFIX] [--report-file REPORT_FILE_PATH]
[--list-collections] [--dry-run] [--call-MAGs]
[--min-completion-for-MAG [0-100]]
[--max-redundancy-for-MAG [0-100]]
[--size-for-MAG MEGABASEPAIRS]
Parameters
DEFAULT INPUTS: Standard stuff
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
--collection-to-read COLLECTION_TO_READ
Collection name to read from. Anvi'o will not
overwrite an existing collection, instead, it will
create a copy of your collection with new bin names.
--collection-to-write COLLECTION_TO_WRITE
The new collection name. Give it a nice, fancy name.
OUTPUT AND TESTING: a.k.a, sweet parameters of convenience
--prefix PREFIX Prefix for the bin names. Must be a single word,
composed of digits and numbers. The use of the
underscore character is OK, but that's about it (fine,
the use of the dash character is OK, too but no
more!). If the prefix is 'PREFIX', each bin will be
renamed as 'PREFIX_XXX_00001, PREFIX_XXX_00002', and
so on, in the order of percent completion minus
percent redundancy (what we call, 'substantive
completion'). The 'XXX' part will either be 'Bin', or
'MAG depending on other parameters you use. Keep
reading.
--report-file REPORT_FILE_PATH
This file will report each name change event, so you
can trace back the original names of renamed bins
later.
--list-collections Show available collections and exit.
--dry-run When used does NOT update the profile database, just
creates the report file so you can view how things
will be renamed.
MAG OPTIONS: If you want to call some bins 'MAGs' because you are so cool
--call-MAGs This program by default rename your bins as
'PREFIX_Bin_00001', 'PREFIX_Bin_00002' and so on. If
you use this flag, it will name the ones that meet the
criteria described by MAG-related flags as
'PREFIX_MAG_00001', 'PREFIX_MAG_00002', and so on. The
ones that do not get to be named as MAGs will remain
as bins.
--min-completion-for-MAG [0-100]
If --call-MAGs flag is used, call any bin a 'MAG' if
their completion estimate is above this (the default
is 70), and the redundancy estimate is less than
--max-redundancy-for-MAG.
--max-redundancy-for-MAG [0-100]
If --call-MAGs flag is used, call any bin a 'MAG' if
their redundancy estimate is below this (the default
is 10) and the completion estimate is above --min-
completion-for-MAG.
--size-for-MAG MEGABASEPAIRS
If --call-MAGs flag is used, call any bin a 'MAG' if
their redundancy estimate is less than --max-
redundancy-for-MAG, AND THEIR SIZE IS LARGER THAN THIS
VALUE REGARDLESS OF THE COMPLETION ESTIMATE. The
default behavior is to not care about this at all.
Access reads in contigs and positions in a BAM file
Usage
anvi-report-linkmers [-h] -i INPUT_BAMS) [INPUT_BAM(S ...]
--contigs-and-positions CONTIGS_AND_POS
[--only-complete-links] -o FILE_PATH
[--list-contigs]
Parameters
optional arguments:
-i INPUT_BAM(S) [INPUT_BAM(S) ...], --input-files INPUT_BAM(S) [INPUT_BAM(S) ...]
Sorted and indexed BAM files to analyze. It is
essential that all BAM files must be the result of
mappings against the same contigs.
--contigs-and-positions CONTIGS_AND_POS
This is the file where you list the contigs, and
nucleotide positions you are interested in. This is
supposed to be a TAB-delimited file with two columns.
In each line, the first column should be the contig
name, and the second column should be the comma-
separated list of integers for nucleotide positions.
--only-complete-links
When declared, only reads that cover all positions
will be reported. It is necessary to use this flag if
you want to perform oligotyping-like analyses on
matching reads.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--list-contigs When declared, the program will list contigs in the
BAM file and exit gracefully without any further
analysis.
This program deals with populating tables that store HMM hits in an anvi'o contigs database.
Usage
anvi-run-hmms [-h] -c CONTIGS_DB [-H HMM PROFILE PATH]
[-I HMM PROFILE NAME] [--also-scan-trnas]
[-T NUM_THREADS] [--hmmer-program HMMER_PROGRAM]
[--just-do-it]
Parameters
DB: An anvi'o contigs adtabase to populate with HMM hits
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
HMM OPTIONS: If you have your own HMMs, or if you would like to run only a set of default anvi'o HMM profiles rather than running them all, this is your stop.
-H HMM PROFILE PATH, --hmm-profile-dir HMM PROFILE PATH
You can use this parameter you can specify a directory
path that contain an HMM profile. This way you can run
HMM profiles that are not included in anvi'o. See the
online to find out about the specifics of this
directory structure .
-I HMM PROFILE NAME, --installed-hmm-profile HMM PROFILE NAME
When you run this program without any parameter, it
runs all 4 HMM profiles installed on your system. If
you want only a specific one to run, you can select it
by using this parameter. These are the currently
available ones: "Bacteria_71" (type: singlecopy),
"Archaea_76" (type: singlecopy), "Protista_83" (type:
singlecopy), "Ribosomal_RNAs" (type: Ribosomal_RNAs).
tRNAs: Through this program you can also scan Transfer RNA sequences in your
contigs database for free (instead of running anvi-scan-trnas
later).
--also-scan-trnas Also scan tRNAs while you're at it.
PERFORMANCE: Stuff everyone forgets to set and then get upset with how slow science goes.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--hmmer-program HMMER_PROGRAM
Which of the HMMER programs to use to run HMMs
(hmmscan or hmmsearch). By default anvi'o will use
hmmscan for typical HMM operations like those in anvi-
run-hmms (as these tend to scan a very large number of
genes against a relatively small number of HMMs), but
if you are using this program to scan a very large
number of HMMs, hmmsearch might be a better choice for
performance. For this reason, hmmsearch is the default
in operations like anvi-run-pfams and anvi-run-kegg-
kofams. See this article for a discussion on the
performance of these two programs:
https://cryptogenomicon.org/2011/05/27/hmmscan-vs-
hmmsearch-speed-the-numerology/
AUTHORITY: Because you are the boss.
--just-do-it Don't bother me with questions or warnings, just do
it.
Run KOfam HMMs on an anvi'o contigs database
Usage
anvi-run-kegg-kofams [-h] -c CONTIGS_DB [--kegg-data-dir KEGG_DATA_DIR]
[-T NUM_THREADS] [--hmmer-program HMMER_PROGRAM]
[--keep-all-hits] [--just-do-it]
Parameters
REQUIRED INPUT: The stuff you need for this to work.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
OPTIONAL INPUT: Optional params for a custom experience.
--kegg-data-dir KEGG_DATA_DIR
The directory path for your KEGG setup, which will
include things like KOfam profiles and KEGG MODULE
data. Anvi'o will try to use the default path if you
do not specify anything.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--hmmer-program HMMER_PROGRAM
Which of the HMMER programs to use to run HMMs
(hmmscan or hmmsearch). By default anvi'o will use
hmmscan for typical HMM operations like those in anvi-
run-hmms (as these tend to scan a very large number of
genes against a relatively small number of HMMs), but
if you are using this program to scan a very large
number of HMMs, hmmsearch might be a better choice for
performance. For this reason, hmmsearch is the default
in operations like anvi-run-pfams and anvi-run-kegg-
kofams. See this article for a discussion on the
performance of these two programs:
https://cryptogenomicon.org/2011/05/27/hmmscan-vs-
hmmsearch-speed-the-numerology/
--keep-all-hits If you use this flag, anvi'o will not get rid of any
raw HMM hits, even those that are below the score
threshold.
--just-do-it Don't bother me with questions or warnings, just do
it.
Run NCBI's COGs to associate genes in an anvi'o contigs database with functions. COGs database was been designed as an attempt to classify proteins from completely sequenced genomes on the basis of the orthology concept. It is no longer actively developed, however, it is still very effective for daily needs. You may want to consider Pfams or the eggNOG database for more comprehensive functional insights.
Usage
anvi-run-ncbi-cogs [-h] -c CONTIGS_DB [--cog-data-dir COG_DATA_DIR]
[-T NUM_THREADS] [--sensitive]
[--temporary-dir-path PATH]
[--search-with SEARCH_METHOD]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--cog-data-dir COG_DATA_DIR
The directory path for your COG setup. Anvi'o will try
to use the default path if you do not specify
anything.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--sensitive DIAMOND sensitivity. With this flag you can instruct
DIAMOND to be 'sensitive', rather than 'fast' during
the search. It is likely the search will take
remarkably longer. But, hey, if you are doing it for
your final analysis, maybe it should take longer and
be more accurate. This flag is only relevant if you
are running DIAMOND.
--temporary-dir-path PATH
If you don't provide anything here, this program will
come up with a temporary directory path by itself to
store intermediate files, and clean it later. If you
want to have full control over this, you can use this
flag to define one..
--search-with SEARCH_METHOD
What program to use for database searching. The
default search uses diamond. All available options
include: diamond, blastp.
Run Pfam on Contigs Database.
Usage
anvi-run-pfams [-h] -c CONTIGS_DB [--pfam-data-dir PFAM_DATA_DIR]
[-T NUM_THREADS] [--hmmer-program HMMER_PROGRAM]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--pfam-data-dir PFAM_DATA_DIR
The directory path for your Pfam setup. Anvi'o will
try to use the default path if you do not specify
anything.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--hmmer-program HMMER_PROGRAM
Which of the HMMER programs to use to run HMMs
(hmmscan or hmmsearch). By default anvi'o will use
hmmscan for typical HMM operations like those in anvi-
run-hmms (as these tend to scan a very large number of
genes against a relatively small number of HMMs), but
if you are using this program to scan a very large
number of HMMs, hmmsearch might be a better choice for
performance. For this reason, hmmsearch is the default
in operations like anvi-run-pfams and anvi-run-kegg-
kofams. See this article for a discussion on the
performance of these two programs:
https://cryptogenomicon.org/2011/05/27/hmmscan-vs-
hmmsearch-speed-the-numerology/
The purpose of this program is to affiliate single-copy core genes in an
anvi'o contigs database with taxonomic names. A properly setup local SCG
taxonomy database is required for this program to perform properly. After its
successful run, anvi-estimate-scg-taxonomy
will be useful to estimate
taxonomy at genome-, collection-, or metagenome-level).
Example uses and other resources
Usage
anvi-run-scg-taxonomy [-h] -c CONTIGS_DB
[--scgs-taxonomy-data-dir PATH]
[--min-percent-identity PERCENT_IDENTITY]
[-P NUM_PROCESSES] [-T NUM_THREADS]
[--write-buffer-size INT]
Parameters
INPUT DATABASE: An anvi'o contigs databaes to search for and store the taxonomic affiliations of SCGs.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
ADVANCED STUFF:
--scgs-taxonomy-data-dir PATH
The directory for SCGs data to be stored (or read
from, depending on the context). If you leave it as is
without specifying anything, anvi'o will set up
everything in (or try to read things from) a pre-
defined default directory. The advantage of using the
default directory at the time of set up is that every
user of anvi'o on a computer system will be using a
single data directory, but then you may need to run
the setup program with superuser privileges. If you
don't have superuser privileges, then you can use this
parameter to tell anvi'o the location you wish to use
to setup your databases. If you are using a program
(such as `anvi-run-scg-taxonomy` or `anvi-estimate-
scg-taxonomy`) you will have to use this parameter to
tell those programs where your data are.
--min-percent-identity PERCENT_IDENTITY
The defualt value for this is 90.0%, and in an ideal
world you sholdn't really change it. Lowering this
value will probably give you too many hits from
neighboring genomes, which may ruin your consensus
taxonomy (imagine, at 90% identity you may match to a
single species, but at 70% identity you may match to
every species in a genus and your consensus assignment
may be influenced by that). But once in a while you
will have a genome that doesn't have any close match
in GTDB, and you will be curious to find out what it
could be. So, when you are getting no SCG hits
whatsoever, only then you may want to play with this
value. In those cases you can run anvi-estimate-scg-
taxonomy with a `--debug` flag to see what is really
going on. We strongly advice you to do this only with
single genomes, and never with metagenomes.
PERFORMANCE:
-P NUM_PROCESSES, --num-parallel-processes NUM_PROCESSES
Maximum number of processes to run in parallel. Please
note that this is different than number of threads. If
you ask for 4 parallel processes, and 5 threads,
anvi'o will run four processes in parallel and assign
5 threads to each. For resource allocation you must
multiply the number of processes and threads.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--write-buffer-size INT
How many items should be kept in memory before they
are written do the disk. The default is 500. The
larger the buffer size, the less frequent the program
will access to the disk, yet the more memory will be
consumed since the processed items will be cleared off
the memory only after they are written to the disk.
The default buffer size will likely work for most
cases, but if you feel you need to reduce it, we trust
you. Please keep an eye on the memory usage output to
make sure the memory use never exceeds the size of the
physical memory.
Execute, manage, parallelize, and troubleshoot entire 'omics workflows which chain together anvi'o and 3rd party programs.
metagenomics
phylogenomics
contigs
pangenomics
Example uses and other resources
Usage
anvi-run-workflow [-h] [-w WORKFLOW]
[--get-default-config OUTPUT_FILENAME]
[--list-workflows] [--list-dependencies]
[-c CONFIG_FILE] [--dry-run] [--skip-dry-run]
[--save-workflow-graph] [-A ...]
Parameters
ESSENTIAL INPUTS: Things you must provide or this won't work
-w WORKFLOW, --workflow WORKFLOW
You must specify a workflow name. To see a list of
available workflows run --list-workflows.
ADDITIONAL STUFF: additional stuff
--get-default-config OUTPUT_FILENAME
Store a json formatted config file with all the
default settings of the workflow. This is a good draft
you could use in order to write your own config file.
This config file contains all parameters that could be
configured for this workflow. NOTICE: the config file
is provided with default values only for parameters
that are set by us in the workflow. The values for the
rest of the parameters are determined by the relevant
program.
--list-workflows Print a list of available snakemake workflows
--list-dependencies Print a list of the dependencies of this workflow. You
must provide a workflow name and a config file.
snakemake will figure out which rules need to be run
according to your config file, and according to the
files available on your disk. According to the rules
that need to be run, we will let you know which
programs are going to be used, so that you can make
sure you have all of them installed and loaded.
-c CONFIG_FILE, --config-file CONFIG_FILE
A JSON-formatted configuration file.
--dry-run Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--skip-dry-run Don't do a dry run. Just start the workflow! Useful
when your job is so big it takes hours to do a dry
run.
--save-workflow-graph
Save a graph representation of the workflow. If you
are using this flag and if your system is unable to
generate such graph outputs, you will hear anvi'o
complaining (still, totally worth trying).
-A ..., --additional-params ...
Additional snakemake parameters to add when running
snakemake. NOTICE: --additional-params HAS TO BE THE
LAST ARGUMENT THAT IS PASSED TO anvi-run-workflow,
ANYTHING THAT FOLLOWS WILL BE CONSIDERED AS PART OF
THE ADDITIONAL PARAMETERS THAT ARE PASSED TO
SNAKEMAKE. Any parameter that is accepted by snakemake
should be fair game here, but it is your
responsibility to make sure that whatever you added
makes sense. To see what parameters are available
please refer to the snakemake documentation. For
example, you could use this to set up cluster
submission using --additional-params --cluster 'YOUR-
CLUSTER-SUBMISSION-CMD'.
Identify and store tRNA genes in a contigs database.
Usage
anvi-scan-trnas [-h] -c CONTIGS_DB [-T NUM_THREADS]
[--log-file FILE_PATH] [--trna-hits-file FILE_PATH]
[--trna-cutoff-score INT] [--just-do-it]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--log-file FILE_PATH File path to store debug/output messages.
--trna-hits-file FILE_PATH
File path to store raw hits from tRNA scan.
--trna-cutoff-score INT
Minimum score to assume a hit comes from a proper tRNA
gene (passed to the tRNAScan-SE). The default is 20.
It can get any value between 0-100.
--just-do-it Don't bother me with questions or warnings, just do
it.
Search functions in an anvi'o contigs database or genomes storage. Basically,
this program searches for one or more search terms you define in functional
annotations of genes in an anvi'o contigs database, and generates multiple
reports. The simpler report (which also is the default one) simply tells you
which contigs contain genes with functions matching to serach terms you used.
This file is only useful to quickly highlight matching contigs in the
interface by providing it to the anvi-interactive with the --additional-
layer
parameter. You can also request a much more comprehensive report, which
gives you anything you might need to know, including the matching gene caller
id, functional annotation source, and full function name for each hit and
serach term.
Usage
anvi-search-functions [-h] [-c CONTIGS_DB] [-p PAN_DB]
[-g GENOMES_STORAGE] --search-terms SEARCH_TERMS
[--delimiter CHAR]
[--annotation-sources SOURCE NAME[S]] [-l]
[-o FILE_PATH] [--full-report FILE_NAME]
[--include-sequences] [--verbose]
Parameters
SEARCH IN: Relevant source databases
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
SEARCH FOR: Relevant terms
--search-terms SEARCH_TERMS
Search terms. Multiple of them can be declared
separated by a delimiter (the default is a comma).
--delimiter CHAR The delimiter to parse multiple input terms. The
default is ','.
--annotation-sources SOURCE NAME[S]
Get functional annotations for a specific list of
annotation sources. You can specify one or more
sources by separating them from each other with a
comma character (i.e., '--annotation-sources
source_1,source_2,source_3'). The default behavior is
to return everything
-l, --list-annotation-sources
List available functional annotation sources.
REPORT: Anvi'o can report the hits in multiple ways. The output file will be a
very simple 2-column TAB-delimited output that is compatible with anvi'o
additional data format (so you can give it to the anvi-interactive
to
see which splits contained genes that were matching to your search terms).
You can also ask anvi'o to generate a full-report, that contains much more
and much helpful information about each hit. Optionally you can even ask
the gene sequences to appear in this report.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--full-report FILE_NAME
Optional output file with a fuller description of
findings.
--include-sequences Include sequences in the report.
--verbose Be verbose, print more messages whenever possible.
A script for anvi'o to test itself
Usage
anvi-self-test [-h] [--suite SUITE]
Parameters
optional arguments:
--suite SUITE Suite of tests to execute. By default this program will
execute a full suite of example anvi'o commands to ensure
your installation is ready to run all scenarios anvi'o
developers could think of. Alternatively you can choose a
specific test to run. Here is a full list of available
options: mini, full, pangenomics, alons-classifier, manual-
interface.
Download and setup KEGG KOfam HMM profiles
Usage
anvi-setup-kegg-kofams [-h] [--kegg-data-dir KEGG_DATA_DIR]
[--kegg-archive KEGG_ARCHIVE] [--reset]
[--just-do-it]
Parameters
POSSIBLE INPUT: Not required for this program to run, but could be useful.
--kegg-data-dir KEGG_DATA_DIR
The directory path for your KEGG setup, which will
include things like KOfam profiles and KEGG MODULE
data. Anvi'o will try to use the default path if you
do not specify anything.
--kegg-archive KEGG_ARCHIVE
The path to an archived KEGG directory. If you provide
this parameter, anvi'o will set up the KEGG data
directory from the archive rather than downloading and
building it from the KEGG website.
--reset Remove all the previously stored files and start over.
If something is feels wrong for some reason and if you
believe re-downloading files and setting them up could
address the issue, this is the flag that will tell
anvi'o to act like a real computer scientist
challenged with a computational problem.
--just-do-it Don't bother me with questions or warnings, just do
it.
Download and setup NCBI's Clusters of Orthologous Groups database.
Usage
anvi-setup-ncbi-cogs [-h] [--cog-data-dir COG_DATA_DIR] [--reset]
[--just-do-it] [-T NUM_THREADS]
Parameters
optional arguments:
--cog-data-dir COG_DATA_DIR
The directory for COG data to be stored. If you leave
it as is without specifying anything, the default
destination for the data directory will be used to set
things up. The advantage of it is that everyone will
be using a single data directory, but then you may
need superuser privileges to do it. Using this
parameter you can choose the location of the data
directory somewhere you like. However, when it is time
to run COGs, you will need to remember that path and
provide it to the program.
--reset Remove all the previously stored files and start over.
If something is feels wrong for some reason and if you
believe re-downloading files and setting them up could
address the issue, this is the flag that will tell
anvi'o to act like a real computer scientist
challenged with a computational problem.
--just-do-it Don't bother me with questions or warnings, just do
it.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
Setup or update an offline database of representative PDB structures clustered at 95%.
Usage
anvi-setup-pdb-database [-h] [--pdb-database-path PATH]
[-T NUM_THREADS] [--update]
[--skip-modeller-update] [--reset]
Parameters
optional arguments:
--pdb-database-path PATH
The path for the PDB database to be stored. If you
leave it as is without specifying anything, anvi'o
will set up everything in a pre-defined default
directory. The advantage of using the default
directory at the time of set up is that every user of
anvi'o on a computer system will be using a single
data directory, but then you may need to run the setup
program with superuser privileges. If you don't have
superuser privileges, then you can use this parameter
to tell anvi'o the location you wish to use to setup
your database.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--update Use this flag if you would like to update your current
database.
--skip-modeller-update
By default, MODELLER's search DB is updated when this
program is ran so that if MODELLER finds a protein,
its structure is guaranteed to be in this database. If
you don't want to touch the MODELLER database, use
this flag.
--reset Remove all the previously stored files and start over.
If something is feels wrong for some reason and if you
believe re-downloading files and setting them up could
address the issue, this is the flag that will tell
anvi'o to act like a real computer scientist
challenged with a computational problem.
Download and setup Pfam data from the EBI.
Usage
anvi-setup-pfams [-h] [--pfam-data-dir PFAM_DATA_DIR] [--reset]
Parameters
optional arguments:
--pfam-data-dir PFAM_DATA_DIR
The directory for Pfam data to be stored. If you leave
it as is without specifying anything, the default
destination for the data directory will be used to set
things up. The advantage of it is that everyone will
be using a single data directory, but then you may
need superuser privileges to do it. Using this
parameter you can choose the location of the data
directory somewhere you like. However, when it is time
to run Pfam, you will need to remember that path and
provide it to the program.
--reset This program by default attempts to use previously
downloaded files in your Pfam data directory if there
are any. If something is wrong for some reason you can
use this to tell anvi'o to remove everything, and
start over.
The purpose of this program is to download necessary information from GTDB
(https://gtdb.ecogenomic.org/), and set it up in such a way that your anvi'o
installation is able to assign taxonomy to single-copy core genes using anvi-
run-scg-taxonomy
and estimate taxonomy for genomes or metagenomes using
anvi-estimate-genome-taxonomy
).
Example uses and other resources
Usage
anvi-setup-scg-taxonomy [-h] [-T NUM_THREADS]
[--scgs-taxonomy-data-dir PATH]
[--scgs-taxonomy-remote-database-url URL]
[--reset] [--redo-databases]
Parameters
optional arguments:
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--scgs-taxonomy-data-dir PATH
The directory for SCGs data to be stored (or read
from, depending on the context). If you leave it as is
without specifying anything, anvi'o will set up
everything in (or try to read things from) a pre-
defined default directory. The advantage of using the
default directory at the time of set up is that every
user of anvi'o on a computer system will be using a
single data directory, but then you may need to run
the setup program with superuser privileges. If you
don't have superuser privileges, then you can use this
parameter to tell anvi'o the location you wish to use
to setup your databases. If you are using a program
(such as `anvi-run-scg-taxonomy` or `anvi-estimate-
scg-taxonomy`) you will have to use this parameter to
tell those programs where your data are.
--scgs-taxonomy-remote-database-url URL
Anvi'o will always try to download the latest release,
but if there is a problem with the latest release,
feel free to run setup using a different URL. Just to
note, anvi'o will expect to find the following files
in the URL provided here: 'VERSION',
'ar122_msa_individual_genes.tar.gz',
'ar122_taxonomy.tsv',
'bac120_msa_individual_genes.tar.gz', and
'bac120_taxonomy.tsv'. If everything fails, you can
give this URL, which is supposed to work if teh server
in which these databases are maintained is still
online: https://data.ace.uq.edu.au/public/gtdb/data/re
leases/release89/89.0/
--reset Remove all the previously stored files and start over.
If something is feels wrong for some reason and if you
believe re-downloading files and setting them up could
address the issue, this is the flag that will tell
anvi'o to act like a real computer scientist
challenged with a computational problem.
--redo-databases Remove existing databases and re-create them. This can
be necessary when versions of programs change and
databases they create and use become incompatible.
A script to display collections stored in an anvi'o profile or pan database.
Usage
anvi-show-collections-and-bins [-h] -p PAN_OR_PROFILE_DB
Parameters
optional arguments:
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
Show all misc data keys in all misc data tables
Usage
anvi-show-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB]
[-t NAME] [-D NAME]
Parameters
Database input: Provide 1 of these
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
Details: Everything else.
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
documentation for more information.
-D NAME, --target-data-group NAME
Data group to focus. Anvi'o misc data tables support
associating a set of data keys with a data group. If
you have no idea what this is, then probably you don't
need it, and anvi'o will take care of you. Note: this
flag is IRRELEVANT if you are working with additional
order data tables.
Split an anvi'o pan or profile database into smaller, self-contained pieces. This is usually great when you want to share a subset of an anvi'o project. You give this guy your databases, and a collection id, and it gives you back directories of individual projects for each bin that can be treated as self- contained smaller anvi'o projects. We know you don't read this far into these help menus, but please remember: you will either need to provide a profile & contigs database pair, or a pan & genomes storage pair. The rest will be taken care of. Magic.
Usage
anvi-split [-h] -p PAN_OR_PROFILE_DB [-c CONTIGS_DB]
[-g GENOMES_STORAGE] [--skip-variability-tables]
[--compress-auxiliary-data] [-C COLLECTION_NAME]
[-b BIN_NAME] [-o DIR_PATH] [--list-collections]
[--skip-hierarchical-clustering]
[--enforce-hierarchical-clustering]
[--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]
Parameters
DATABASES: You will either provide a PROFILE/CONTIGS or a PAN/GENOMES STORAGE pair here.
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
PROFILE/CONTIGS OPTIONS: Some options that are specific to this only.
--skip-variability-tables
Processing variability tables in profile database
might take a very long time. With this flag you will
be asking anvi'o to skip them.
--compress-auxiliary-data
When declared, the auxiliary data file in the
resulting output will be compressed. This saves space,
but it takes long. Also, if you are planning to
compress the entire later using GZIP, it is even
useless to do. But you are the boss!
COLLECTION: You should provide a valid collection name. If you do not provide bin names, the program will generate an output for each bin in your collection separately.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
OUTPUT: Where do we want the resulting split profiles to be stored.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
EXTRAS: Stuff that you rarely need, but you really really need when the time comes. Following parameters will aply to each of the resulting anvi'o profile that will be split from the mother anvi'o profile.
--list-collections Show available collections and exit.
--skip-hierarchical-clustering
If you are not planning to use the interactive
interface (or if you have other means to add a tree of
contigs in the database) you may skip the step where
hierarchical clustering of your items are preformed
based on default clustering recipes matching to your
database type.
--enforce-hierarchical-clustering
If you have more than 25,000 splits in your merged
profile, anvi-merge will automatically skip the
hierarchical clustering of splits (by setting --skip-
hierarchical-clustering flag on). This is due to the
fact that computational time required for hierarchical
clustering increases exponentially with the number of
items being clustered. Based on our experience we
decided that 25,000 splits is about the maximum we
should try. However, this is not a theoretical limit,
and you can overwrite this heuristic by using this
flag, which would tell anvi'o to attempt to cluster
splits regardless.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
If you do not use this flag, the default distance
metric will be used for each clustering configuration
which is "euclidean".
--linkage LINKAGE_METHOD
The same story with the `--distance`, except, the
system default for this one is ward.
Summarizer for anvi'o pan or profile db's. Essentially, this program takes a collection id along with either a profile database and a contigs database or a pan database and a genomes storage and generates a static HTML output for what is described in a given collection. The output directory will contain almost everything any downstream analysis may need, and can be displayed using a browser without the need for an anvi'o installation. For this reason alone, reporting summary outputs as supplementary data with publications is a great idea for transparency and reproducibility.
Usage
anvi-summarize [-h] -p PAN_OR_PROFILE_DB [-c CONTIGS_DB]
[-g GENOMES_STORAGE] [--init-gene-coverages]
[--reformat-contig-names]
[--report-aa-seqs-for-gene-calls]
[--report-DNA-sequences] [-C COLLECTION_NAME]
[-o DIR_PATH] [--list-collections]
[--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
[--cog-data-dir COG_DATA_DIR] [--quick-summary]
[--just-do-it]
Parameters
PROFILE: The profile. It could be a standard or pan profile database.
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
PROFILE TYPE SPECIFIC PARAMETERS: If you are summarizing a collection stored in a standard anvi'o profile, you will need a contigs database to go with it. If you are working with a pan profile, then you will need to provide a genomes storage. Don't worry too much, because anvi'o will warn you gently if you make a mistake.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
STANDARD PROFILE SPECIFIC PARAMS: Parameters that are only relevant to standard profile summaries (declaring or not declaring them will not change anything if you are summarizing a pan profile).
--init-gene-coverages
Initialize gene coverage and detection data. This is a
very computationally expensive step, but it is
necessary when you need gene level coverage data. The
reason this is very computationally expensive is
because anvi'o computes gene coverages by going back
to actual coverage values of each gene to average
them, instead of using contig average coverage values,
for extreme accuracy.
--reformat-contig-names
Reformat contig names while generating the summary
output so they look fancy. With this flag, anvi'o will
replace the original names of contigs to those that
include the bin name as a prefix in resulting summary
output files per bin. Use this flag carefully as it
may influence your downstream analyses due to the fact
that your original contig names in your input FASTA
file for the contigs database will not be in the
summary output. Although, anvi'o will report a
conversion map per bin so you can recover the original
contig name if you have to.
--report-aa-seqs-for-gene-calls
You can use this flag if you would like amino acid AND
dna sequences for your gene calls in the genes output
file. By default, only dna sequences are reported.
PAN PROFILE SPECIFIC PARAMS: Parameters that are only relevant to pan profile summaries (declaring or not declaring them will not change anything if you are summarizing a standard profile).
--report-DNA-sequences
By default, this program reports amino acid sequences.
Use this flag to report DNA sequences instead. Also
note, since gene clusters are aligned via amino acid
sequences, using this flag removes alignment
information manifesting in the form of gap characters
(`-` characters) that would be present if amino acid
sequences were reported. Read the warnings during
runtime for more detailed information.
COMMONS: Common parameters for both pan and standard profile summaries.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
--list-collections Show available collections and exit.
--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
The taxonomic level to use whenever relevant and/or
available. The default taxonomic level is t_genus, but
if you choose something specific, anvi'o will focus on
that whenever possible.
--cog-data-dir COG_DATA_DIR
The directory path for your COG setup. Anvi'o will try
to use the default path if you do not specify
anything.
--quick-summary When declared the summary output will be generated as
quickly as possible, with minimum amount of essential
information about bins.
EXTRA: Extra stuff because you're extra.
--just-do-it Don't bother me with questions or warnings, just do
it.
Update the description in an anvi'o database
Usage
anvi-update-db-description [-h] --description TEXT_FILE DB
Parameters
positional arguments:
DB An anvi'o database.
optional arguments:
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
Add or re-run genes from an already existing structure database. All settings used to generate your database will be used in this program.
Usage
anvi-update-structure-database [-h] -c CONTIGS_DB -s STRUCTURE_DB
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS]
[--dump-dir DUMP_DIR]
[--list-modeller-params] [--rerun-genes]
[--modeller-executable MODELLER_EXECUTABLE]
[-T NUM_THREADS]
[--write-buffer-size-per-thread INT]
Parameters
DATABASES: Declaring relevant anvi'o databases. First things first.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.
GENES: Specify which genes you want to be modelled. If a gene already exists in the DB, it will be overwritten if –overwrite is set. Otherwise, an error will be raised.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
column header).
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately, mistakes are cheap, so it's
worth a try.
OUTPUT: Output file and output style.
--dump-dir DUMP_DIR Modeling and annotating structures requires a lot of
moving parts, each which have their own outputs. The
output of this program is a structure database
containing the pertinent results of this computation,
however a lot of stuff doesn't make the cut. By
providing a directory for this parameter you will get,
in addition to the structure database, a directory
containing the raw output for everything.
MODELLER PARAMS: Parameters for MODELLER's homology modeling.
--list-modeller-params
Since you are updating an existing DB, modeller params
are set in place. You can have this program list them
by providing this flag
EXTRA: Everything else.
--rerun-genes Supply if you would like to rerun structural modelling
for your genes of interest if they are already present
in your DB
--modeller-executable MODELLER_EXECUTABLE
The MODELLER program to use. For example, `mod9.19`.
Anvi'o will try and find it if not provided.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--write-buffer-size-per-thread INT
How many items should be kept in memory before they
are written do the disk. The default is 25 per thread.
So a single-threaded job would have a write buffer
size of 25, whereas a job with 4 threads would have a
write buffer size of 4*25. The larger the buffer size,
the less frequent the program will access to the disk,
yet the more memory will be consumed since the
processed items will be cleared off the memory only
after they are written to the disk. The default buffer
size will likely work for most cases. Please keep an
eye on the memory usage output to make sure the memory
use never exceeds the size of the physical memory. If
--num-threads is 1, this parameter is ignored because
the DB is written to after each gene
Download and install minor releases of anvi'o from a Github repository.
Usage
anvi-upgrade [-h] [--repository REPOSITORY]
Parameters
optional arguments:
--repository REPOSITORY
Source repository to download releases, currently only
Github is supported. Enter in 'merenlab/anvio' format.
A script to add a 'DEFAULT' collection in an anvi'o pan or profile database with a bin named 'EVERYTHING' that describes all items available in the profile database.
Usage
anvi-script-add-default-collection [-h] -p PAN_OR_PROFILE_DB
[-c CONTIGS_DB] [-b BIN_NAME]
[-C COLLECTION_NAME]
Parameters
optional arguments:
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database (and even genes
database in appropriate contexts).
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-b BIN_NAME, --bin-id BIN_NAME
Name for the new bin. If you don't provide any then it
will be named "EVERYTHING".
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Name for the new collection. If you don't provide any
then it will be named "DEFAULT".
Takes in gene calls by AUGUSTUS v3.3.3, generates an anvi'o external gene calls file. It may work well with other versions of AUGUSTUS, too. It is just no one has tested the script with different versions of the program.
Usage
anvi-script-augustus-output-to-external-gene-calls [-h] -i INPUT_FILE
[-o FILE_PATH]
[--just-do-it]
Parameters
optional arguments:
-i INPUT_FILE, --input-file INPUT_FILE
Gene calls file from AUGUSTUS (that ends with .gff).
Please note that the script is only tested with
AUGUSTUS v3.3.3 output (although it may still work
with other versions of the program).
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--just-do-it Don't bother me with questions or warnings, just do
it.
This program calculates for each gene the ratio of pN/pS (the metagenomic analogy of dN/dS; see doi:10.1038/nature11711 and doi:10.7717/peerj.2959) based on metagenomic read recruitment, however, unlike standard pN/pS calculations, it relies on codons rather than nucleotides for accurate estimations of synonimity.
Usage
anvi-script-calculate-pn-ps-ratio [-h] [-a SAAV_FILE] [-b SCV_FILE] -c
CONTIGS_DB [-j FLOAT]
[-i MINIMUM_NUM_VARIANTS]
[-m MIN_COVERAGE] -o DIR_PATH
Parameters
VARIABILITY: You provide two variability tables generated with anvi-gen-variability- profile: one for SAAVs (generated with –engine AA) and for SCVs (generated with –engine CDN). They must be generated with the same profile and contigs database pair. To be safe, we recommended you use the same settings during both commands except for changing –engine AA to –engine CDN and the output filename.
-a SAAV_FILE, --saav-table SAAV_FILE
Filepath to the SAAV table.
-b SCV_FILE, --scv-table SCV_FILE
Filepath to the SCV table.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Filepath to the contigs database used to generate
variability tables.
TUNABLES: Successfully tune one or more of these parameters to unlock the badge 'Advanced anvian'.
-j FLOAT, --min-departure-from-consensus FLOAT
Variants (either SCVs or SAAVs) will be ignored if
they have a departure from consensus less than this
value. Note: Keep in mind you may have already
supplied this parameter during anvi-gen-variability-
profile. The default value is "0.10".
-i MINIMUM_NUM_VARIANTS, --minimum-num-variants MINIMUM_NUM_VARIANTS
Ignore genes with less than this number of single
codon variants. This avoids being impressed by pN/pS
values of infinite, when in reality the gene had a
single SAAV and no synonymous SCVs. The default is 4
to ensure a default value with some level of
statistical importance.
-m MIN_COVERAGE, --min-coverage MIN_COVERAGE
If the coverage value at a codon is less than this
amount, any SAAVs or SCVs associated with it will be
ignored. The default is 30.
OUTPUT: The output of this program is a folder directory with several tables.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
Reformat FASTA file (remove contigs based on length, or based on given lists of deflines to include/exclude, and/or generate an output with simpler names)
Example uses and other resources
Usage
anvi-script-checkm-tree-to-interactive [-h] -t CHECKM TREE -o DIRECTORY
Parameters
optional arguments:
-t CHECKM TREE, --tree CHECKM TREE
Tree file generated by CheckM.
-o DIRECTORY, --output-dir DIRECTORY
The directory name that output files will be stored.
Run ANI between contigs in a single FASTA file.
Usage
anvi-script-compute-ani-for-fasta [-h] -f FASTA -o DIR_PATH [-p PAN_DB]
[-T NUM_THREADS]
[--log-file FILE_PATH]
[--method {ANIm,ANIb,ANIblastall,TETRA}]
[--distance DISTANCE_METRIC]
[--linkage LINKAGE_METHOD]
[--just-do-it]
Parameters
optional arguments:
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--log-file FILE_PATH File path to store debug/output messages.
--method {ANIm,ANIb,ANIblastall,TETRA}
Method for pyANI. The default is ANIb. You must have
the necessary binary in path for whichever method you
choose. According to the pyANI help for v0.2.7 at
https://github.com/widdowquinn/pyani, the method
'ANIm' uses MUMmer (NUCmer) to align the input
sequences. 'ANIb' uses BLASTN+ to align 1020nt
fragments of the input sequences. 'ANIblastall': uses
the legacy BLASTN to align 1020nt fragments Finally,
'TETRA': calculates tetranucleotide frequencies of
each input sequence
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
The default is "euclidean".
--linkage LINKAGE_METHOD
The linkage method for the hierarchical clustering.
The default is "ward".
--just-do-it Don't bother me with questions or warnings, just do
it.
A program to estimate the size of the actual population genome to which a MAG belongs.
Usage
anvi-script-estimate-genome-size [-h] -c CONTIGS_DB [--verbose]
Parameters
MANDATORY INPUT: An anvi'o contigs database that hopefully contains a MAG.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
PARAMETERS OF CONVENIENCE: Because life is already very hard as it is.
--verbose Be verbose, print more messages whenever possible.
Filter FASTA file according to BLAST table (remove sequences with bad BLAST alignment).
Usage
anvi-script-filter-fasta-by-blast [-h] [-f FASTA] [-o FILE_PATH] -b TAB
DELIMITED FILE -s OUTFMT -t THRESHOLD
[--just-do-it]
Parameters
optional arguments:
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-b TAB DELIMITED FILE, --blast-output TAB DELIMITED FILE
BLAST table generated with blastp. `--outfmt 6` as the
output format is assumed.
-s OUTFMT, --outfmt OUTFMT
Specify the column ordering of your BLAST report. We
add the following paramter to our BLAST searches so
the output report contains the `qlen` field, which is
not included by default: `-outfmt '6 qseqid sseqid
pident length mismatch gapopen qstart qend sstart send
evalue bitscore qlen slen'`. You may have used a
different `-outfmt` paramter, and you should use this
parameter to explicitly define the column names in
your output file. For instance, if you had used the
parameter mentioned above, then the correct version of
this parameter would be: "qseqid sseqid pident length
mismatch gapopen qstart qend sstart send evalue
bitscore qlen slen". Regardless of the BLAST output
format, your columns MUST contain the following
parameters for this program to work properly:
'qseqid', 'bitscore', 'length', 'qlen', and 'pident'.
-t THRESHOLD, --threshold THRESHOLD
What `proper_pident` threshold do you want to use for
filtering out sequences whose top bit-score matches
have `proper_pident`s less than this threshold? We
have defined `proper_pident` to be the percentage of
the query amino acids that both aligned to and were
identical to the corresponding matched amino acid.
Note that the `pident` parameter output by BLAST does
not include regions of the query sequence unaligned to
the matched sequence, whereas `proper_pident` does.
For example, a sequence that's only half aligned by a
match but with 100% identity at matched regions has a
`pident` of 100 but a `proper_pident` of 50. The
default is 30.0%.
--just-do-it Don't bother me with questions or warnings, just do
it.
Train a classifier for CPR prediction
Usage
anvi-script-gen-CPR-classifier [-h] [-o CLASSIFIER_FILE] MATRIX_FILE
Parameters
positional arguments:
MATRIX_FILE TAB-delimited matrix of CPR genome names, classes, and
presence absence of single-copy genes. Headers of the
first two rows should be "genome", and "class". The
rest of the rows shold be single-copy genes.
optional arguments:
-o CLASSIFIER_FILE, --output CLASSIFIER_FILE
Output file name for the classifier.
Quantify the detection of genes in genomes in metagenomes to identify the environmental core. This is a helper script for anvi'o metapangenomic workflow.
Usage
anvi-script-gen-distribution-of-genes-in-a-bin [-h] -c CONTIGS_DB
[-p PROFILE_DB]
[-C COLLECTION_NAME]
[-b BIN_NAME]
[--min-detection FLOAT]
[--fraction-of-median-coverage FLOAT]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
--min-detection FLOAT
For this entire thing to work, the genome you are
focusing on should be detected in at least one
metagenome. If that is not the case, it would mean
that you do not have any sample that represents the
niche for this organism (or you do not have enough
depth of coverage) to investigate the detection of
genes in the environment. By default, this script
requires at least '0.5' of the genome to be detected
in at least one metagenome. This parameter allows you
to change that. 0 would mean no detection test
required, 1 would mean the entire genome must be
detected.
--fraction-of-median-coverage FLOAT
The value set here will be used to remove a gene if
its total coverage across environments is less than
the median coverage of all genes multiplied by this
value. The default is 0.25, which means, if the median
total coverage of all genes across all samples is
100X, then, a gene with a total coverage of less than
25X across all samples will be assumed not a part of
the 'environmental core'.
Generate a static web page for anvio'o help pages
Usage
anvi-script-gen-help-pages [-h] [-o DIR_PATH]
Parameters
optional arguments:
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
A simple script to generate a TAB-delimited file that reports the frequency of HMM hits for a given HMM source across contigs databases.
Usage
anvi-script-gen-hmm-hits-matrix-across-genomes [-h] [-e FILE_PATH]
[-i FILE_PATH]
[--hmm-source SOURCE NAME]
[-l] -o FILE_PATH
Parameters
INPUT: INTERNAL/EXTERNAL GENOMES FILE: Yes. You need to use an internal and/or external genomes file to tell anvi'o where your contigs databases are.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
line should read 'name', and the second should read
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
-i FILE_PATH, --internal-genomes FILE_PATH
A five-column TAB-delimited flat text file. The header
line must contain these columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.
HMM STUFF: This is where you can specify an HMM source, and/or a list of genes to filter your results.
--hmm-source SOURCE NAME
Use a specific HMM source. You can use '--list-hmm-
sources' flag to see a list of available resources.
The default is 'None'.
-l, --list-hmm-sources
List available HMM sources in the contigs database and
quit.
OUTPUTTAH:
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
Generate a network of anvi'o programs
Usage
anvi-script-gen-programs-network [-h] [-o FILE_PATH]
[-p PROGRAM_NAMES_TO_FOCUS]
Parameters
optional arguments:
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-p PROGRAM_NAMES_TO_FOCUS, --program-names-to-focus PROGRAM_NAMES_TO_FOCUS
Comma-spearated list of program names to focus Mostly
for debugging purposes.
Generate a markdown summary (vignette) of anvi'o programs
Usage
anvi-script-gen-programs-vignette [-h] [-o FILE_PATH]
[-p PROGRAM_NAMES_TO_FOCUS]
Parameters
optional arguments:
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-p PROGRAM_NAMES_TO_FOCUS, --program-names-to-focus PROGRAM_NAMES_TO_FOCUS
Comma-spearated list of program names to focus Mostly
for debugging purposes.
Take a FASTQ file and convert it into 2 FASTQ files. Each read from the original FASTQ file halved, where one half is put in the R1 FASTQ and the other half is reverse complemented and put in the R2 FASTQ. If you've ended up here, things have clearly not gone very well for you, and as I write this sentence, I wholeheartedly sympathize.
Usage
anvi-script-gen-pseudo-paired-reads-from-fastq [-h] -f FASTQ
[-O FILENAME_PREFIX]
Parameters
optional arguments:
-f FASTQ, --fastq FASTQ
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
If you want final FASTQs with the format
myfastq_1.fastq and myfastq_2.fastq, then this
parameter should be set to myfastq
Train a classifier for SCG domain prediction
Usage
anvi-script-gen-scg-domain-classifier [-h] [--genomes-dir GENOMES_DIR]
[-o PATH]
Parameters
optional arguments:
--genomes-dir GENOMES_DIR
This should be a directory that contains a directory
per domain for single-copy core gene collections a
given version of anvi'o knows about. For instance, if
there are collections for archaea, bacteria, and
eukarya, then this directory should contain
subdirectories with these names. Contents of which
should be contigs databases that belong to those
domains. These genomes will be used to generate the
classifier.
-o PATH, --output PATH
Output file name for the classifier.
Generate short reads from contigs
Usage
anvi-script-gen-short-reads [-h] [--output-file-path FASTA_FILE]
CONFIG_FILE
Parameters
positional arguments:
CONFIG_FILE Configuration file
optional arguments:
--output-file-path FASTA_FILE
Output FASTA file path
A simple script to generate info from search tables
Usage
anvi-script-gen_stats_for_single_copy_genes.py [-h] [--list-sources]
[--source SOURCE]
CONTIGS_DB
Parameters
positional arguments:
CONTIGS_DB Contigs database to read from.
optional arguments:
--list-sources Show available single-copy gene search results and exit.
--source SOURCE Source to focus on. If none declared, all single-copy gene
sources are going to be listed.
Get nucleotide-level, contig-level, or bin-level coverage values from a BAM file
Usage
anvi-script-get-coverage-from-bam [-h] -b BAM_FILE [-c CONTIG_NAME]
[-l CONTIGS_OF_INTEREST]
[-C COLLECTION_TXT] -m
{pos,contig,bin} -o OUTPUT
[--skip-contigs-check]
Parameters
REQUIRED: Declare your BAM file here
-b BAM_FILE, --bam-file BAM_FILE
Sorted and indexed BAM file to analyze.
OPTION #1: This is the first and simplest option. Provide a contig name
-c CONTIG_NAME, --contig-name CONTIG_NAME
The name of a single contig
OPTION #2: Use this to characterize coverage for a list of contigs
-l CONTIGS_OF_INTEREST, --contigs-of-interest CONTIGS_OF_INTEREST
Provide here a file where each line is a contig name.
OPTION #3: Use this to characterize coverage for a collection of contig sets (bins)
-C COLLECTION_TXT, --collection-txt COLLECTION_TXT
Provide a collection text file. The first column
should be contig names and the second column should be
the bin to which the contig belongs. If you have a
collection from a profile database, you can export it
in this format with anvi-export-collection.
METHOD: Do you want to report coverage at a nucleotide level? Contig averages? Bin averages? Pick the method here.
-m {pos,contig,bin}, --method {pos,contig,bin}
If pos, each nucleotide position will be reported
(valid for OPTION #1, #2, #3). If contig, report
contains contig averages (valid for OPTION #2, #3). If
bin, report contains bin averages (valid for OPTION
#3).
OUTPUT: Your output file is decided here. Keep in mind if you use –method pos, this file will contain as many lines as there are nucleotides defined by your input option
-o OUTPUT, --output OUTPUT
Output tab-delimited file path. Will overwrite
existing files.
EXTRAS: All the misfits
--skip-contigs-check Checking to see that your collection text or contigs
of interest file has correct names can take a really
long time if you have a large enough number of
contigs. Use this flag to forego checking, and find
out the hard way.
A simple script to generate a TAB-delimited file gene caller IDs and their HMM hits for a given HMM source.
Usage
anvi-script-get-hmm-hits-per-gene-call [-h] -c CONTIGS_DB
[--hmm-source SOURCE NAME] -o
FILE_PATH
Parameters
INPUT: ANVI'O CONTIGS DB:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
INPUT: HMM SOURCE:
--hmm-source SOURCE NAME
Use a specific HMM source. You can use '--list-hmm-
sources' flag to see a list of available resources.
The default is 'None'.
OUTPUTTAH:
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
You give this program one or more FASTQ files and a short sequence, and it returns all short reads from the FASTQ file that matches to it. The purpose of this is to get back short reads that may be extending into hypervariable regions of genomes, resulting a decreased mappability of short reads in the metagenome given a reference. You often see those areas of genomes as significant dips in coverage, and in most cases with a large number of SNVs. When you provide the downstream conserved sequence, this program allows you to take a better look at those regions at the short read level without any mapping.
Usage
anvi-script-get-short-reads-matching-something [-h] --match-sequence
SHORT SEQUENCE [-m INT]
-s NAME [-O PATH]
[--report-raw]
[--stop-after INT]
FASTQ_FILES
[FASTQ_FILES ...]
Parameters
positional arguments:
FASTQ_FILES One or more FASTQ formatted files
optional arguments:
--match-sequence SHORT SEQUENCE
Short sequence to look for..
-m INT, --min-remainder-length INT
Minimum lenght of the remainder of the read after the
match. If your short read is XXXMMMMMMYYYYYYYYYYYYYY,
where M indicates nucleotides of matchhing sequence,
min remainder length is len(Y). Default is 60.
-s NAME, --sample-name NAME
A short sample name (use a single word without spaces
or fancy chars)
-O PATH, --output-directory PATH
Output directory for results to be stored. The default
is the current working directory.
--report-raw Just report them raw. Don't bother trimming.
--stop-after INT Stop after X number of hits because who needs data.
Generate an additional data file from multiple collections.
Usage
anvi-script-merge-collections [-h] -c CONTIGS_DB -i FILES) [FILE(S ...]
-o OUTPUT_FILE
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
Input file(s). TAB-delimited input files should have
two columns, where the first column holds the contig
name, and the second one the bin id. This is the
standard ouptut of the program anvi-export-collection.
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Output file name.
Screen for genomes to find likely members of CPR
Usage
anvi-script-predict-CPR-genomes [-h] -c CONTIGS_DB [-p PROFILE_DB]
[-C COLLECTION_NAME]
[--list-collections]
[--report-only-cpr]
[--min-genome-size MIN_GENOME_SIZE]
[--min-percent-completion MIN_PERCENT_COMPLETION]
[--max-percent-redundancy MAX_PERCENT_REDUNDANCY]
[--min-class-probability MIN_CLASS_PROBABILITY]
[-o FILE_PATH] [--just-do-it]
CLASSIFIER_FILE
Parameters
positional arguments:
CLASSIFIER_FILE Model output generated by anvi-script-gen-CPR-
classifier
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--list-collections Show available collections and exit.
--report-only-cpr Include only bins that look like CPR genomes.
--min-genome-size MIN_GENOME_SIZE
Minimum genome size to consider for CPR in Mbp.
Default is 0.500000
--min-percent-completion MIN_PERCENT_COMPLETION
Minimum percent completion estimate based on anvi'o
default single-copy gene collections. Default is 50
--max-percent-redundancy MAX_PERCENT_REDUNDANCY
Maxumum percent redundancy or single-copy genes in an
anvi'o bin, or a genome to consider for
classification. The default is 30
--min-class-probability MIN_CLASS_PROBABILITY
If the classification confidence is below this don't
bother. Default is 75.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--just-do-it Don't bother me with questions or warnings, just do
it.
This script takes a GenBank file, and outputs a FASTA file, as well as two
additional TAB-delimited output files for external gene calls and gene
functions that can be used with the programs anvi-gen-contigs-database
and
anvi-import-functions
.
Usage
anvi-script-process-genbank [-h] -i GENBANK [-O FILENAME_PREFIX]
[--output-fasta FASTA]
[--output-gene-calls TAB DELIMITED FILE]
[--output-functions TAB DELIMITED FILE]
[--annotation-source ANNOTATION_SOURCE]
[--annotation-version ANNOTATION_VERSION]
Parameters
INPUT: Give us the preciousss…
-i GENBANK, --input-genbank GENBANK
Input GenBank file
OUTPUT: You either provide a 'prefix', or provide specific output file names/paths. You can't mix the two (well, you can try).
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
--output-fasta FASTA Output FASTA file path.
--output-gene-calls TAB DELIMITED FILE
Output file path for external gene calls
--output-functions TAB DELIMITED FILE
Output file path for anvi'o-importable gene functions
file
DETAILS: Setting the annotation source and version data to appear in the output file for functional annotations file.
--annotation-source ANNOTATION_SOURCE
Annotation source (default: "NCBI_PGAP")
--annotation-version ANNOTATION_VERSION
Annotation source version to be stored in the database
(default: "v4.6")
This script takes the 'metadata' output of the program ncbi-genome-download
(see https://github.com/kblin/ncbi-genome-download for details), and processes
each GenBank file found in the metadata file to generate a FASTA file, as well
as genes and functions files for each entry. Plus, it autmatically generates a
FASTA TXT file descriptor for anvi'o snakemake workfloes. So it is a multi-
talented program like that.
Usage
anvi-script-process-genbank-metadata [-h] -m GENBANK_METADATA
[-o DIR_PATH]
[--output-fasta-txt OUTPUT_FASTA_TXT]
[-E]
Parameters
INPUT: Give us the preciousss…
-m GENBANK_METADATA, --metadata GENBANK_METADATA
This is the file you get from the program `ncbi-
genome-download` when you use the parameter
`--metadata-table`.
OUTPUT: Where to find your precioussesss…
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
--output-fasta-txt OUTPUT_FASTA_TXT
This is not a FASTA file, but a TAB-delimited file
with all the file names and paths processed by this
program. This output can directly go into the anvi'o
snakemake workflows because magic.
ADDITIONAL PARAMETERS: Additional things you can set.
-E, --exclude-gene-calls-from-fasta-txt
This flag will exclude the external gene calls and
functions from the fasta.txt file. Files for external
gene calls and functions according to the information
stored in GenBank file, but they will simply not be
included in your fasta.txt file. By doing so you will
*gurantee* that when you use this file from within a
workflow, anvi'o wil use its default gene caller to
identify genes.
Reformat FASTA file (remove contigs based on length, or based on a given list of deflines, and/or generate an output with simpler names)
Usage
anvi-script-reformat-fasta [-h] [-l MIN_LENGTH]
[--max-percentage-gaps PERCENTAGE]
[-i TXT FILE] [-I TXT FILE] -o FASTA FILE
[--simplify-names] [--prefix PREFIX]
[-r REPORT FILE]
FASTA FILE
Parameters
positional arguments:
FASTA FILE
optional arguments:
-l MIN_LENGTH, --min-len MIN_LENGTH
Minimum length of contigs to keep (contigs shorter
than this value will not be included in the output
file). The default is 0, so nothing will be removed if
you do not declare a minimum size.
--max-percentage-gaps PERCENTAGE
Maximum fraction of gaps in a sequence (any sequence
with more gaps will be removed from the output FASTA
file). The default is 100.000000.
-i TXT FILE, --exclude-ids TXT FILE
IDs to remove from the FASTA file. You cannot provide
both --keep-ids and --exclude-ids.
-I TXT FILE, --keep-ids TXT FILE
If provided, all IDs not in this file will be excluded
from the reformatted FASTA file. Any additional
filters (such as --min-len) will still be applied to
the IDs in this file. You cannot provide both
--exclude-ids and --keep-ids.
-o FASTA FILE, --output-file FASTA FILE
Output file path.
--simplify-names Edit deflines to make sure they contigs have simple
names.
--prefix PREFIX Use this parameter if you would like to add a prefix
to your contig names while simplifying them. The
prefix must be a single word (you can use underscor
character, but nothing more!).
-r REPORT FILE, --report-file REPORT FILE
Report file path. When you run this program with
`--simplify-names` flag, all changes to deflines will
be reported in this file in case you need to go back
to this information later. It is not mandatory to
declare one, but it is a very good idea to have it.
Run eggnog-mapper on a contigs database, and store results
Usage
anvi-script-run-eggnog-mapper [-h] -c CONTIGS_DB
[--cog-data-dir COG_DATA_DIR]
[-T NUM_THREADS]
[--drop-previous-annotations]
[--annotation EMAPPER_ANNOTATION_FILE]
[--use-version EMAPPER_VERSION]
Parameters
optional arguments:
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs-database'
--cog-data-dir COG_DATA_DIR
The directory path for your COG setup if you did not
use the default directory.
-T NUM_THREADS, --num-threads NUM_THREADS
Maximum number of threads to use for multithreading
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
/ cores on your system. Plus, please be careful with
this option if you are running your commands on a SGE
--if you are clusterizing your runs, and asking for
multiple threads to use, you may deplete your
resources very fast.
--drop-previous-annotations
When declared, previous annotations in the database
will be dropped.
--annotation EMAPPER_ANNOTATION_FILE
If you have an annotation file from a previous run,
you can call this program to import the contents of
that file into the database instead of a run from
scratch. In that case, you must also use the `--use-
version` parameter to clarify which parser version
should be used to parse it.
--use-version EMAPPER_VERSION
The version of eggnog-mapper that generated the
annotation file.
Take the output of anvi-gen-variability-profile, prepare an output for interactive interface
Usage
anvi-script-snvs-to-interactive [-h]
[--min-departure-from-consensus FLOAT]
[--max-departure-from-consensus FLOAT]
[--min-departure-from-reference FLOAT]
[--max-departure-from-reference FLOAT]
[--display-dep-from-reference]
[--only-in-genes] [--random INTEGER]
[--just-do-it] -o DIR_PATH
VARIABILITY_PROFILE
Parameters
positional arguments:
VARIABILITY_PROFILE The output file generated by anvi-gen-variability-
profile
optional arguments:
--min-departure-from-consensus FLOAT
Minimum departure from consensus at a given variable
nucleotide position. The default is 0.00.
--max-departure-from-consensus FLOAT
Maximum departure from consensus at a given variable
nucleotide position. The default is 1.00.
--min-departure-from-reference FLOAT
Minimum departure from consensus at a given variable
nucleotide position. The default is 0.00.
--max-departure-from-reference FLOAT
Maximum departure from consensus at a given variable
nucleotide position. The default is 1.00.
--display-dep-from-reference
By default this program will generate a matrix file
that displays departure from consensus values. This
flag will switch to departure from reference.
--only-in-genes With this flag you will ignore SNVs in non-coding
regions.
--random INTEGER Use this parameter to randomly subset your data. If
there are too many SNV positions, this script may take
forever to finish. You should *never* let it try to
deal with more than 25-30K points, but an ideal would
be around 4-5 thousand.
--just-do-it Don't bother me with questions or warnings, just do
it.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
Tabulates TAB-delmited data with headers in terminal: cat table.txt | anvi-
script-tabulate
Usage
anvi-script-tabulate [-h]
Parameters
Transpose a TAB-delimited file
Usage
anvi-script-transpose-matrix [-h] -o MATRIX_FILE MATRIX_FILE
Parameters
positional arguments:
MATRIX_FILE Input matrix.
optional arguments:
-o MATRIX_FILE, --output-file MATRIX_FILE
File path to store results.
A script to convert SNV output obtained from anvi-gen-variability-profile to the standard VCF format
Usage
anvi-script-variability-to-vcf [-h] [-i FILE_PATH] [-o FILE_PATH]
Parameters
optional arguments:
-i FILE_PATH, --input FILE_PATH
Filepath to the SNV table. This is the output from the
anvi-gen-variability-profile program with the
nucleotide engine (which is the default engine).
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
A blog post detailing InteracDome's integration into anvi'o
Ways to visualize mapping results in anvi'o to make informed statements about environmental populations and to generate high-quality figures)
A flexible and scalable approach to locate and extract target genetic loci from larger genetic contexts.
An attempt at alchemy combining the magic of GTDB and single-copy core genes in anvi'o.
A primer on how to find your way through the maze of microbial 'omics and anvi'o
How to download, process, and combine genomes from NCBI in your phylogenomic, pangenomic, and/or other 'omics analyses
Mike Lee heroically combines reference annotations with new annotations
A light introduction to questions of microbial ecology and microbial omics through the story of crassphage
A discussion on the practical and theoretical aspects of using anvi'o structure
Bringing the magic of anvi'o together with the wonders of snakemake.
A preliminary set of 83 HMMs from BUSCO. -Lol is anvi'o doing picoeuks now fam? -Why yes, yes it does.
A hacker's guide to anvi'o networking. Are you scared? You should be!
Details of a beautiful algorithm.
Anvi'o projects meet the underappreciated owners of this planet.
KEGG modules, meet anvi'o. Anvi'o, meet KEGG modules.
Getting additional data in an out of pan and profile databases like a pro.
Use anvi'o to get rRNA genes out of single-cells, clultivars, metagenome-assembled genomes, or even from entire metagenomic assemblies because why not.
That is easy, but not simple
If Prokka doesn't come to you, you go to Prokka
Is the big brother watching you? If he does, what does he see?
More than just completion and redundancy estimates
Mike Lee heroically demystifies the view options
Has decades of suffering of thousands of anvi'o users come to an end?
The user-friendly interface anvi'o provides to work with pangenomes.
In other words, not putting screenshots in papers
Yes. Good ol' COGs. Into your contigs db. Just like that. 60% of the time, every time.
A resource to find out whether we are coming to or you just missed an event close to a location near you.
Instructions to install the current release of the platform.
Sweets for people who managed to install the platform and want to do genome-resolved metagenomics.
Recipes to install various software tools anvi'o uses
Various ways to add the taxonomic annotations into anvi'o
Making those functions in the summary output bloom with stuff!
So you have an assembly, or a draft genome, or a MAG, but no metagenomic short reads? That's OK.
Mike Lee demonstrates how to use a custom HMM single-copy gene profile for archaeal genomes
Data types, usage tips, and other stuff about the interface
Musings over a *Nitrospira* genome that can do complete nitrification
A quick way to get an insight into the number of genomes your contigs represent.
For people who does not have time to download stuff.
The user-friendly interface anvi'o provides to work with pangenomes.
Fresh anvi'o builds for the lazy.
Do you want to try anvi'o? Do you have a MAC computer? We got you covered.
Exploring micro-diversity patterns using for deeper insights into ecology
Anvi'o provides an interface to screen for possible contaminants and curate individual genomes
Bowtie, Bowtie2, BWA, CLC, GSNAP, BBMap, Novoalign, and SMALT.
Tricks for people who like to go deeper.
Dealing with heavily contaminated bins identified in an unsupervised manner.
Sweets for people who managed to install the platform.
Instructions to install the v1 brach of the platform.