Microbial 'omics

Key Value
Anvi'o version margaret (v5.2)
Profile DB version 30
Contigs DB version 12
Structure DB version 1
Pan DB version 12
Genome data storage version 6
Auxiliary data storage version 2

# Programs

## anvi-cluster-with-concoct

A program to cluster items in a merged anvi'o profile using CONCOCT, and optionally creating a collection in the profile database. This is especially useful if you need to have more control over the number of clusters to work with if you are planning to refine them manually later.

Usage

anvi-cluster-with-concoct [-h] -p PROFILE_DB -c CONTIGS_DB
[-o FILE_PATH] [--skip-store-in-db]
[-C COLLECTION_NAME]
[--num-clusters-requested INT]


Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--skip-store-in-db    By default, analysis results are stored in the profile
database. The use of this flag will let you skip that
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--num-clusters-requested INT
How many clusters do you request? Default is 400.


## anvi-compute-ani

Export sequences from external genomes and compute ANI. If Pan Database is given anvi'o will write computed output to misc data tables of Pan Database.

Usage

anvi-compute-ani [-h] [-i FILE_PATH] [-e FILE_PATH] -o DIR_PATH
[-p PAN_DB] [-T NUM_THREADS] [--log-file FILE_PATH]
[--method {ANIm,ANIb,ANIblastall,TETRA}]
[--distance DISTANCE_METRIC]


Parameters

optional arguments:

  -i FILE_PATH, --internal-genomes FILE_PATH
A four-column TAB-delimited flat text file. The header
line must contain thse columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.
-e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.
--log-file FILE_PATH  File path to store debug/output messages.
--method {ANIm,ANIb,ANIblastall,TETRA}
Method for pyANI. The default is ANIb. You must have
the necessary binary in path for whichever method you
choose. According to the pyANI help for v0.2.7 at
https://github.com/widdowquinn/pyani, the method
'ANIm' uses MUMmer (NUCmer) to align the input
sequences. 'ANIb' uses BLASTN+ to align 1020nt
fragments of the input sequences. 'ANIblastall': uses
the legacy BLASTN to align 1020nt fragments Finally,
'TETRA': calculates tetranucleotide frequencies of
each input sequence
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
The default is "euclidean".
The linkage method for the hierarchical clustering.
The default is "ward".
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-compute-completeness

A script to generate completeness info for a given list of splits

Usage

anvi-compute-completeness [-h] [--splits-of-interest FILE] -c
CONTIGS_DB [-e E-VALUE]
[--list-completeness-sources]
[--completeness-source NAME]


Parameters

optional arguments:

  --splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-e E-VALUE, --min-e-value E-VALUE
Minimum significance score of an HMM find to be
considered as a valid hit. Default is 1e-15.
--list-completeness-sources
Show available sources and exit.
--completeness-source NAME
Single-copy gene source to use to estimate
completeness.


## anvi-compute-gene-cluster-homogeneity

Compute homogeneity for gene clusters

Usage

anvi-compute-gene-cluster-homogeneity [-h] -p PAN_DB
[-g GENOMES_STORAGE]
[-o FILE_PATH] [--store-in-db]
[--gene-cluster-id GENE_CLUSTER_ID]
[--gene-cluster-ids-file FILE_PATH]
[-C COLLECTION_NAME]
[-b BIN_NAME]
[--quick-homogeneity]


Parameters

INPUT FILES: Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file


REPORTING: How do you want results to be reported? Anvi'o can produce a TAB-delimited output file for you (for which you would have to provide an output file name). Or the results can be stored in the pan database directly, for which you would have to explicitly ask for it. You can get both as well in case you are a fan of redundancy and poor data analysis practices. Anvi'o does not judge.

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
--store-in-db         Store analysis results into the database directly.


SELECTION: Which gene clusters should be analyzed. You can ask for a single gene cluster, or multiple ones listed in a file, or you can use a collection and bin name to list gene clusters of interest.

  --gene-cluster-id GENE_CLUSTER_ID
Gene cluster ID you are interested in.
--gene-cluster-ids-file FILE_PATH
Text file for gene clusters (each line should contain
be a unique gene cluster id).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.


OPTIONAL: Optional stuff available for you to use

  --quick-homogeneity   By default, anvi'o will use a homogeneity algorithm
that checks for horizontal and vertical geometric
homogeneity (along with functional). With this flag,
you can tell anvi'o to skip horizontal geometric
homogeneity calculations. It will be less accurate but
quicker.
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-db-info

Access self tables, display values, or set new ones totally on your own risk.

Usage

anvi-db-info [-h] [--self-key SELF_KEY] [--self-value SELF_VALUE]
[--just-do-it]
DATABASE_PATH


Parameters

Input: The database path you wish to access.

  DATABASE_PATH         An anvi'o database for pan, profile, contigs, or
auxiliary data


Very dangerous zone: For power users with extreme self-control and maturity.

  --self-key SELF_KEY   The key you wish to set or change
--self-value SELF_VALUE
The value you wish to set for the self key
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-delete-collection

Remove a collection from a given profile database.

Usage

anvi-delete-collection [-h] -p PROFILE_DB [-C COLLECTION_NAME]
[--list-collections]


Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--list-collections    Show available collections and exit.


## anvi-delete-hmms

Remove HMM hits from an anvi'o contigs database.

Usage

anvi-delete-hmms [-h] -c CONTIGS_DB [--hmm-source SOURCE NAME] [-l]
[--just-do-it]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--hmm-source SOURCE NAME
Use a specific HMM source. You can use '--list-hmm-
sources' flag to see a list of available resources.
The default is 'None'.
-l, --list-hmm-sources
List available HMM sources in the contigs database and
quit.
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-delete-misc-data

Remove stuff from additional data or order tables in pan or profile databases for items or layers

Usage

anvi-delete-misc-data [-h] -p PAN_OR_PROFILE_DB -t NAME
[--keys-to-remove KEYS_TO_REMOVE]
[--list-available-keys] [--just-do-it]


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
--keys-to-remove KEYS_TO_REMOVE
A comma-separated list of data keys to remove from the
database. If you do not use this parameter, anvi'o
will simply remove everything from the target data
table immediately.
--list-available-keys
Using this flag will list available data keys in the
target data table and quit without doing anything
else.
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-delete-state

Delete an anvi'o state from a pan or profile database.

Usage

anvi-delete-state [-h] -p PAN_OR_PROFILE_DB [-s STATE_NAME]
[--list-states]


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-s STATE_NAME, --state STATE_NAME
The state name to ... delete :(
--list-states         Show available states and exit.


## anvi-display-contigs-stats

Start the anvi'o interactive interactive for viewing or comparing contigs statistics

Usage

anvi-display-contigs-stats [-h] [--report-as-text] [-o FILE_PATH]
[-I IP_ADDR] [-P INT] [--browser-path PATH]
[--server-only]
CONTIG DATABASES) [CONTIG DATABASE(S ...]


Parameters

positional arguments:

  CONTIG DATABASE(S)    Anvio'o Contig databases to display statistics, you
can give multiple databases by seperating them with
space.


REPORT CONFIGURATION: Specify what kind of output you want.

  --report-as-text      If you give this flag, Anvi'o will not open new
browser to show Contigs database statistics and write
all stats to TAB separated file and you should also
give --output-file with this flag otherwise Anvi'o
will complain.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH   By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--server-only         The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.


## anvi-display-pan

Start an anvi'o server to display a pan-genome

Usage

anvi-display-pan [-h] -p PAN_DB [-g GENOMES_STORAGE] [-d VIEW_DATA]
[-A ADDITIONAL_LAYERS] [--view NAME] [--title NAME]
[--export-svg FILE_PATH] [--skip-init-functions]
[--server-only]


Parameters

INPUT FILES: Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file


OPTIONAL INPUTS: Where the yay factor becomes a reality.

  -d VIEW_DATA, --view-data VIEW_DATA
A TAB-delimited file for view data
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure


  -V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
A TAB-delimited file for an additional view to be used
in the interface. This file file should contain all
split names, and values for each of them in all
samples. Each column in this file must correspond to a
sample name. Content of this file will be called
'user_vuew', which will be available as a new item in
the 'views' combo box in the interface
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.


  --view NAME           Start the interface with a pre-selected view. To see a
list of available views, use --show-views flag.
--title NAME          Title for the interface. If you are working with a
RUNINFO dict, the title will be determined based on
information stored in that file. Regardless, you can
override that value using this parameter. If you are
not using a anvio RUNINFO dictionary, a meaningful
title will appear in the interface only if you define
one using this parameter.
Automatically load previous saved state and draw tree.
To see a list of available states, use --show-states
flag.
Automatically load a collection and draw tree. To see
a list of available collections, use --list-
collections flag.
--export-svg FILE_PATH
The SVG output file path.


SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).

  --skip-init-functions
When declared, function calls for genes will not be
initialized (therefore will be missing from all
relevant interfaces or output files). The use of this
flag may reduce the memory fingerprint and processing
time for large datasets.
--dry-run             Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--skip-auto-ordering  When declared, the attempt to include automatically
generated orders of items based on additional data is
skipped. In case those buggers cause issues with your
data, and you still want to see your stuff and deal
with the other issue maybe later.


SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH   By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--read-only           When the interactive interface is started with this
flag, all 'database write' operations will be
disabled.
--server-only         The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.


## anvi-display-structure

optional arguments: -h, –help show this help message and exit

Usage

anvi-display-structure [-h] -s STRUCTURE_DB [-p PROFILE_DB]
[-c CONTIGS_DB] [-V VARIABILITY_TABLE]
[--splits-of-interest FILE] [-C COLLECTION_NAME]
[-b BIN_NAME] [--samples-of-interest FILE]
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS] [-j FLOAT]
[-P INT] [--browser-path PATH] [--server-only]


Parameters

STRUCTURE: Information related to the structure database, which can be created with anvi-gen-structure-database.

  -s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.


VARIABILITY: We can overlay codon and amino acid variability in your metagenomes but we need a data source of this variability. Most simply, anvi'o can learn this information when you provide both your profile (-p) and contigs (-c) databases. Alternatively, you can provide a variability table output (-V) from the program anvi-gen-variability-profile. Finally, you can visualize the structures without overlaying variation by providing the flag –no- variability.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-V VARIABILITY_TABLE, --variability-profile VARIABILITY_TABLE
FIXME


REFINING PARAMETERS: Which samples, genes, and contigs etc. are you interested in? Define that stuff here.

  --splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
If provided, any genes found in both your bin and your
structure database will be available for display.
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately mistakes are cheap, so it's
worth a try.
-j FLOAT, --min-departure-from-consensus FLOAT
Takes a value between 0 and 1, where 1 is maximum
divergence from the consensus. it can be an expensive
operation to display every variable position, and so
the default is 0.05. To display every variable
position, set this parameter to 0
--SAAVs-only          If provided, variability will be generated for single
amino acid variants (SAAVs) and not for single codon
variants (SCVs). This could save you some time if
you're only interested in SAAVs.
--SCVs-only           If provided, variability will be generated for single
codon variants (SCVs) and not for single amino acid
variants (SAAVs). This could save you some time if
you're only interested in SCVs.


SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH   By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--server-only         The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.


## anvi-experimental-organization

why yes we do stuff here.

Usage

anvi-experimental-organization [-h] [-p PROFILE_DB] -c CONTIGS_DB
[-i DIR_PATH] [-N NAME]
[--distance DISTANCE_METRIC]
[--skip-store-in-db] [-o FILE_PATH]
[--dry-run]
PATH


Parameters

positional arguments:

  PATH                  Config file for clustering of contigs. See
documentation for help.


optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-i DIR_PATH, --input-directory DIR_PATH
Input directory where the input files addressed from
the configuration file can be found (i.e., the profile
database, if PROFILE.db::TABLE notation is used in the
configuration file).
-N NAME, --name NAME  The name to use when storing the resulting clustering
in the database. This name will appear in the
interactive interface and other relevant interfaces.
Please consider using a short and descriptive single-
word (if you do not do that you will make anvi'o
complain).
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
If you do not use this flag, the distance metric you
defined in your clustering config file will be used.
If you have not defined one in your config file, then
the system default will be used, which is "euclidean".
Same story with the --distance, except, the system
default for this one is ward.
--skip-store-in-db    By default, analysis results are stored in the profile
database. The use of this flag will let you skip that
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--dry-run             Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.


## anvi-export-collection

Export a collection from an anvi'o database

Usage

anvi-export-collection [-h] -p PAN_OR_PROFILE_DB [-C COLLECTION_NAME]
[-O FILENAME_PREFIX] [--list-collections]
[--include-unbinned]


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
--list-collections    Show available collections and exit.
--include-unbinned    When this flag is used, anvi'o will also store in the
output file the items that do not appear in any of
your bins. This new bin will be called
'UNBINNED_ITEMS_BIN'. Yes. The ugly name is
intentional.


## anvi-export-contigs

Export contigs (or splits) from an anvi'o contigs database

Usage

anvi-export-contigs [-h] -c CONTIGS_DB [--splits-mode] -o FILE_PATH


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-export-functions

Export gene function calls from an anvi'o contigs database

Usage

anvi-export-functions [-h] -c CONTIGS_DB [-o FILE_PATH]
[--annotation-sources SOURCE NAME[S]] [-l]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--annotation-sources SOURCE NAME[S]
Get functional annotations for a specific list of
annotation sources. You can specifiy one or more
sources by separating them from each other with a
comma character (i.e., '--annotation-sources
source_1,source_2,source_3'). The default behavior is
to return everything
-l, --list-annotation-sources
List available functional annotation sources.


## anvi-export-gene-calls

Export gene calls from an anvi'o contigs database.

Usage

anvi-export-gene-calls [-h] -c CONTIGS_DB [-o FILE_PATH]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-export-gene-coverage-and-detection

Export gene coverage and detection data from

Usage

anvi-export-gene-coverage-and-detection [-h] -p PROFILE_DB -c
CONTIGS_DB -O FILENAME_PREFIX


Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).


## anvi-export-locus

Search for a function or HMM hit and for each matching gene get the sequence including a block of genes the around the match. The output will be written to a fasta file (or multiple files, see –separate-fasta option below. The headers of the sequences in the fasta file hold some information about the gene.)

Usage

anvi-export-locus [-h] -c CONTIGS_DB -n NUM_GENES [-s SEARCH_TERM]
[--gene-caller-ids GENE_CALLER_IDS]
[--delimiter CHAR] -O FILENAME_PREFIX
[--separate-fasta] [--use-hmm]
[--hmm-sources SOURCE NAME] [-l] [-W]
[--remove-partial-hits] [--never-reverse-complement]


Parameters

Essential INPUT:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-n NUM_GENES, --num-genes NUM_GENES
For each match (to the function, or HMM that was
searched) a sequence which includes a block of genes
will be saved. The block could include either genes
only in the forward direction of the gene (defined
according to the direction of transcription of the
gene) or reverse or both. If you wish to get both
direction use a comma (no spaces) to define the block
For example, "-n 4,5" will give you four genes before
and five genes after. Whereas, "-n 5" will give you
five genes after (in addition to the gene that
matched). To get only genes preceeding the match use
"-n 5,0". If the number of genes requested exceeds the
length of the contig, then the output will include the
sequence until the end of the contig.


Additional essential INPUT - OPTION 1: Search according to either HMM or functional annotations

  -s SEARCH_TERM, --search-term SEARCH_TERM
Search term.


Additional essential INPUT - OPTION 2: Search specific gene id's

  --gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately mistakes are cheap, so it's
worth a try.
--delimiter CHAR      The delimiter to parse multiple input terms. The
default is ','.


THE OUTPUT: Where should the output go. It will be one FASTA file with all matches or one FASTA per match (see –separate-fasta)

  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).


ADDITIONAL STUFF: Flags and parameters you can set according to your need

  --separate-fasta      Split each match to a separate FASTA file.
--use-hmm             Use HMM hits instead of functional annotations. If you
choose this option, you must also say which HMM source
to use.
--hmm-sources SOURCE NAME
Get sequences for a specific list of HMM sources. You
can list one or more sources by separating them from
each other with a comma character (i.e., '--hmm-
sources source_1,source_2,source_3'). If you would
like to see a list of available sources in the contigs
database, run this program with '--list-hmm-sources'
flag.
-l, --list-hmm-sources
List available HMM sources in the contigs database and
quit.
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
--remove-partial-hits
By default anvi'o will return hits even if they are
partial. Declaring this flag will make anvi'o filter
all hits that are partial. Partial hits are hits in
which you asked for n1 genes before and n2 genes after
the gene that matched the search criteria but the
search hits the end of the contig before finding the
number of genes that you asked.
--never-reverse-complement
By default, if a gene that is found by the search
criteria is reverse in it's direction, then the
sequence of the entire locus is reversed before it is
saved to the output. If you wish to prevent this
behavior then use the flag --never-reverse-complement.


## anvi-export-misc-data

Export additional data or order tables in pan or profile databases for items or layers.

Usage

anvi-export-misc-data [-h] -p PAN_OR_PROFILE_DB -t NAME [-D NAME]
[-o FILE_PATH]


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
-D NAME, --target-data-group NAME
Data group to focus. Anvi'o misc data tables support
associating a set of data keys with a data group. If
you have no idea what this is, then probably you don't
need it, and anvi'o will take care of you. Note: this
flag is IRRELEVANT if you are working with additioanl
order data tables.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-export-splits-and-coverages

Export sequences and mean coverages across samples for splits or contigs

Usage

anvi-export-splits-and-coverages [-h] -p PROFILE_DB -c CONTIGS_DB
[-o DIR_PATH] [-O FILENAME_PREFIX]
[--splits-mode] [--report-contigs]
[--use-Q2Q3-coverages]


Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
--splits-mode         Specify this flag if you would like to output
coverages of individual 'splits', rather than their
'parent' contig coverages.
--report-contigs      By default this program reports sequences and their
coverages for 'splits'. By using this flag, you can
report contig sequences and coverages instead. For
obvious reasons, you can't use this flag with
--splits-mode flag.
--use-Q2Q3-coverages  By default this program reports the mean coverage of a
split (or contig, see --report-contigs) for each
sample. By using this flag, you can report the mean
Q2Q3 coverage by excluding 25 percent of the
nucleotide positions with the smallest coverage
values, and 25 percent of the nucleotide positions
with the largest coverage values. The hope is that
this removes 'outlier' positions resulting from non-
specific mapping, etc. that skew the mean coverage
estimate.


## anvi-export-splits-taxonomy

Export taxonomy for splits found in an anvi'o contigs database

Usage

anvi-export-splits-taxonomy [-h] -c CONTIGS_DB -o FILE_PATH


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-export-state

Export an anvi'o state into a profile database.

Usage

anvi-export-state [-h] -p PAN_OR_PROFILE_DB [-o FILE_PATH]
[-s STATE_NAME] [--list-states]


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-s STATE_NAME, --state STATE_NAME
The state name to export.
--list-states         Show available states and exit.


## anvi-export-table

Export anvi'o database tables as TAB-delimited text files.

Usage

anvi-export-table [-h] [--table TABLE_NAME] [-l] [-f FIELDS]
[-o FILE_PATH]
DB


Parameters

positional arguments:

  DB                    Anvi'o database to read from.


optional arguments:

  --table TABLE_NAME    Table name to export.
-l, --list            Gives a list of tables in a database and quits. If a
table is already declared this time it lists all the
fields in a given table, in case you would to export
only a specific list of fields from the table using
--fields parameter.
-f FIELD(S), --fields FIELD(S)
Fields to report. USe --list-tables parameter with a
table name to see available fields You can list fields
using this notation: --fields 'field_1, field_2, ...
field_N'.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-gen-contigs-database

Generate a new anvio contigs database.

Usage

anvi-gen-contigs-database [-h] -f FASTA [-n PROJECT_NAME]
[-o DB_FILE_PATH] [--description TEXT_FILE]
[-L INT] [-K INT] [--skip-gene-calling]
[--external-gene-calls GENE-CALLS]
[--ignore-internal-stop-codons]
[--skip-mindful-splitting]


Parameters

MANDATORY INPUTS: Things you really need to provide to be in business.

  -f FASTA, --contigs-fasta FASTA
The FASTA file that contains reference sequences you
mapped your samples against. This could be a reference
genome, or contigs from your assembler. Contig names
in this file must match to those in other input files.
If there is a problem anvi'o will gracefully complain
-n PROJECT_NAME, --project-name PROJECT_NAME
Name of the project. Please choose a short but
descriptive name (so anvi'o can use it whenever she
needs to name an output file, or add a new table in a
database, or name her first born).


OPTIONAL INPUTS: Things you may want to tweak.

  -o DB_FILE_PATH, --output-db-path DB_FILE_PATH
Output file path for the new database.
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
-L INT, --split-length INT
Splitting very large contigs into multiple pieces
improves the efficacy of the visualization step. The
default value is (20000). If you are not sure, we
advise you to not go below 10,000. The lower you go,
the more complicated the tree will be, and will take
more time and computational resources to finish the
analysis. Also this is not a case of 'the smaller the
split size the more sensitive the results'. If you do
not want your contigs to be split, you can either
simply enter '0' or ANY OTHER negative integer (lots
of unnecessary freedom here, enjoy!).
-K INT, --kmer-size INT
K-mer size for k-mer frequency calculations. The
default k-mer size for composition-based analyses is
4, historically. Although tetra-nucleotide frequencies
seem to offer the the sweet spot of sensitivity,
information density, and manageable number of
dimensions for clustering approaches, you are welcome
to experiment (but maybe you should leave it as is for
--skip-mindful-splitting
By default, anvi'o attempts to prevent soft-splitting
large contigs by cutting prper gene calles to make
sure a single gene is not broken into multiple splits.
This requires a careful examination of where genes
start and end, and to find best locations to split
contigs with respect to this informtion. So, when the
user asks for a split size of, say, 1,000, it serves
as a mere suggestion. When this flag is used, anvi'o
does what the user wants and creates splits at desired
lengths (although some functionality may become
unavailable for the projects that rely on a contigs
database that is initiated this way).


GENES IN CONTIGS: Expert thingies.

  --skip-gene-calling   By default, generating an anvi'o contigs database
includes the identification of open reading frames in
contigs by running a bacterial gene caller. Declaring
this flag will by-pass that process. If you prefer,
you can later import your own gene calling results
into the database.
--external-gene-calls GENE-CALLS
A TAB-delimited file to utilize external gene calls.
The file must have these columns: 'gene_callers_id' (a
unique integer number for each gene call, start from
1), 'contig' (the contig name the gene call is found),
'start' (start position, integer), 'stop' (stop
position, integer), 'direction' (the direction of the
gene open reading frame; can be 'f' or 'r'), 'partial'
(whether it is a complete gene call, or a partial one;
must be 1 for partial calls, and 0 for complete
calls), 'source' (the gene caller), and 'version' (the
version of the gene caller, i.e., v2.6.7 or v1.0). An
example file can be found via the URL
https://goo.gl/TqCWT2
--ignore-internal-stop-codons
This is only relevant when you have an external gene
calls file. If anvi'o figures out that your custom
gene calls result in amino acid seqeunces with stop
codons in the middle, it will complain about it. You
can use this flag to tell anvi'o to don't check for
internal stop codons, EVEN THOUGH IT MEANS THERE IS
MOST LIKELY SOMETHING WRONG WITH YOUR EXTERNAL GENE
CALLS FILE. Anvi'o will understand that sometimes we
don't want to care, and will not judge you. Instead,
it will replace every stop codon residue in the amino
acid sequence with an 'X' character. Please let us
know if you used this and things failed, so we can
tell you that you shouldn't have really used it if you
didn't like failures at the first place (smiley).


## anvi-gen-gene-consensus-sequences

Collapse variability for a set of genes across samples

Usage

anvi-gen-gene-consensus-sequences [-h] -p PROFILE_DB -c CONTIGS_DB
[--gene-caller-ids GENE_CALLER_IDS]
[--genes-of-interest FILE]
[--samples-of-interest FILE]
[-o FILE_PATH] [--tab-delimited]
[--engine ENGINE] [--contigs-mode]
[--compress-samples]


Parameters

optional arguments:

  --compress-samples    Normally all samples with variation will have their
own consensus sequence. If this flag is provided, the
coverages from each sample of interest will be summed
and only a single consenus sequence for each
gene/contig will be output.


DATABASES: Declaring relevant anvi'o databases. First things first.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


FOCUS: What do we want? A consensus sequence for a gene, or a list of genes. From where do we want it? All samples, by default. When do we want it? Whenever it is convenient.

  --gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately mistakes are cheap, so it's
worth a try.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).


OUTPUT: Output file and output style

  -o FILE_PATH, --output-file FILE_PATH
The output file name. The boring default is
"genes.fa". You can change the output file format to a
TAB-delimited file using teh flag --tab-delimited,
in which case please do not forget to change the file
name, too.
--tab-delimited       Use the TAB-delimited format for the output file.


EXTRAS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…

  --engine ENGINE       Varaibility engine. The default is 'NT'.
--contigs-mode        Use this flag to output consensus sequences of
contigs, instead of the default, which is genes


## anvi-gen-genomes-storage

Create a genome storage from internal or external genomes for a pan genome analysis.

Usage

anvi-gen-genomes-storage [-h] [-e FILE_PATH] [-i FILE_PATH]
[--gene-caller GENE-CALLER] -o FILE_PATH


Parameters

EXTERNAL GENOMES: External genomes listed as anvi'o contigs databases. As in, you have one or more genomes say from NCBI you want to work with, and you created an anvi'o contigs database for each one of them.

  -e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.


INTERNAL GENOMES: Genome bins stored in an anvi'o profile databases as collections.

  -i FILE_PATH, --internal-genomes FILE_PATH
A four-column TAB-delimited flat text file. The header
line must contain thse columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.


PRO STUFF: Things you may not have to change. But you never know (unless you read the help).

  --gene-caller GENE-CALLER
The gene caller to utilize. Anvi'o supports multiple
gene callers, and some operations (including this one)
requires an explicit mentioning of which one to use.
The default is 'prodigal', but it will not be enough
if you if you were a rebel adn have used --external-
gene-callers or something.


OUTPUT: Give it a nice name. Must end with '-GENOMES.db'. This is primarily due to the fact that there are other .db files used throughout anvi'o and it would be better to distinguish this very special file from them.

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-gen-network

Generate a Gephi network for functions based on non-normalized gene coverage values

Usage

anvi-gen-network [-h] -p PROFILE_DB -c CONTIGS_DB
[--annotation-source SOURCE NAME] [-l]


Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--annotation-source SOURCE NAME
Get functional annotations for a specific annotation
source. You can use the flag '--list-annotation-
sources' to learn about what sources are available.
-l, --list-annotation-sources
List available functional annotation sources.


## anvi-gen-phylogenomic-tree

Generate phylogenomic tree from aligment file.

Usage

anvi-gen-phylogenomic-tree [-h] -f FASTA -o FILE_PATH
[--program PROGRAM_NAME]


Parameters

INPUT FILES: Concatenated aligment files exported using anvi-get-sequences-for-gene- clusters

  -f FASTA, --fasta-file FASTA
A FASTA-formatted input file


OUTPUT FILE: The output file where the generated newick tree will be stored.

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.


PROGRAM: The program that will be used for generating tree. Available options: default, fasttree

  --program PROGRAM_NAME
Program name.


## anvi-gen-structure-database

Identifies genes in your contigs database that encode proteins that are homologous to proteins with solved structures. If sufficiently similar homologs are identified, they are used as structural templates to predict the 3D structure of proteins in your contigs database. This means we are at the mercy of structural biologists: if they have not solved a structure of a protein sufficiently similar in AA sequence to yours, this isn't going to work. But it's worth a try! The software we are using is MODELLER, more of which can be learned about at https://salilab.org/modeller/, or in our tutorial, which doesn't exist yet FIXME

Usage

anvi-gen-structure-database [-h] -c CONTIGS_DB
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS]
[-o DB_FILE_PATH] [--dump-dir DUMP_DIR]
[--num-models NUM_MODELS]
[--deviation DEVIATION]
[--modeller-database MODELLER_DATABASE]
[--scoring-method SCORING_METHOD]
[--very-fast]
[--percent-identical-cutoff PERCENT_IDENTICAL_CUTOFF]
[--max-number-templates MAX_NUMBER_TEMPLATES]
[--skip-DSSP]
[--modeller-executable MODELLER_EXECUTABLE]


Parameters

DATABASES: Declaring relevant anvi'o databases. First things first.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


GENES: Specifying which genes you want to be modelled.

  --genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately mistakes are cheap, so it's
worth a try.


OUTPUT: Output file and output style.

  -o DB_FILE_PATH, --output-db-path DB_FILE_PATH
Output file path for the new database.
--dump-dir DUMP_DIR   Modelling and annotating structures requires a lot of
moving parts, each which have their own outputs. The
output of this program is a structure database
containing the pertinent results of this computation,
however a lot of stuff doesn't make the cut. By
providing a directory for this parameter you will get,
in addition to the structure database, a directory
containing the raw output for everything.


MODELLER PARAMS: Parameters for MODELLER's homology modeling.

  --num-models NUM_MODELS, -N NUM_MODELS
This parameter determines the number of predicted
structures that are solved for a given protein. The
original atomic positions for each model are perturbed
by an amount defined by --deviation, which leads to
differences between each model. Therefore, whichever
of the N models is chosen to be the "best" model is
more likely to be accurate when --num-models is high,
since more of the solution space is searched. It
should be kept in mind that the largest determinant of
a model's accuracy is determined by the protein
templates used, so no need to go overboard with an
excessively large --num-models. The default is 3.
--deviation DEVIATION, -d DEVIATION
Deviation (angstroms)
--modeller-database MODELLER_DATABASE, -D MODELLER_DATABASE
Which database do you want to search the structures
of? Default is "pdb_95". If you have your own database
it must have either the extension .bin or .pir. If you
don't have a database or don't know what this means,
don't worry, we will both inform you and take care of
you.
--scoring-method SCORING_METHOD, -b SCORING_METHOD
How should the best model be decided? The metric used
could be any of GA341_score, DOPE_score, and molpdf.
GA341 is an absolute measure, where a good model will
have a score near 1.0, whereas anything below 0.6 can
be considered bad. DOPE and molpdf scores are relative
energy measures, where lower scores are better. DOPE
has been generally shown to be a better distinguisher
between good and bad models than molpdf. By default,
https://salilab.org/modeller/tutorial/basic.html.
--very-fast           If provided, a very fast optimization is done for each
model at the cost of accuracy. It is recommended to
use a --num-models of 1, since the optimization is so
crude that all models will likely converge to the same
solution.
--percent-identical-cutoff PERCENT_IDENTICAL_CUTOFF, -p PERCENT_IDENTICAL_CUTOFF
If a protein in the database has a proper percent
identity to the gene of interest that is greater than
or equal to --percent-identical-cutoff, then it is
used as a template. Otherwise it is not. Here we
define proper percent identity as the percentage of
amino acids in the gene of interest that are identical
to an entry in the database given the sequence length
of the gene of interest. For example, if there is 100%
identity between the gene of interest and the template
over the length of the alignment, but the alignment
length is only half of the gene of interest sequence
length, then the proper percent identical is 50%.
(This helps us avoid the inflation of identity scores
due to only partially good matches). The default is
30.
--max-number-templates MAX_NUMBER_TEMPLATES, -T MAX_NUMBER_TEMPLATES
Generally speaking it is best to use as many templates
as possible given that they have high proper percent
identity to the gene of interest. Taken from https://s
alilab.org/modeller/methenz/andras/node4.html: 'The
use of several templates generally increases the model
accuracy. One strength of MODELLER is that it can
combine information from multiple template structures,
in two ways. First, multiple template structures may
be aligned with different domains of the target, with
little overlap between them, in which case the
modeling procedure can construct a homology-based
model of the whole target sequence. Second, the
template structures may be aligned with the same part
of the target, in which case the modeling procedure is
likely to automatically build the model on the locally
best template [43,44]. In general, it is frequently
beneficial to include in the modeling process all the
templates that differ substantially from each other,
if they share approximately the same overall
similarity to the target sequence.' The default is 5.


EXTRA: Everything else.

  --skip-DSSP           Dictionary of Secondary Structure of Proteins (DSSP)
is a program that takes as its input a protein
structure file and outputs predicted secondary
structure (alpha helix, beta strand, etc.), measures
of solvent accessibility, and hydrogen bonds for each
residue in the protein. If for some reason you don't
want this, provide this flag.
--modeller-executable MODELLER_EXECUTABLE
The MODELLER program to use. For example, mod9.19.
The default is mod9.19


## anvi-gen-variability-matrix

Generate Variability Matrix

Usage

anvi-gen-variability-matrix [-h] -c CONTIGS_DB --splits-of-interest
FILE [--samples-of-interest FILE]
[--num-positions-from-each-split INT]
[-m INT] [-r RATIO] [-o FILE_PATH]
SUMMARY_DICT


Parameters

positional arguments:

  SUMMARY_DICT          Summary file


optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
--num-positions-from-each-split INT
Each split may have one or more variable positions. By
default, anvi'o will report every SNV position found
define a cutoff for the maximum number of SNVs to be
reported from a split (if the number of SNVs is more
than the number you declare using this parameter, the
positions will be randomly subsampled).
-m INT, --min-scatter INT
This one is tricky. If you have N samples in your
dataset, a given variable position x in one of your
splits can split your N samples into t groups based
on the identity of the variation they harbor at
position x. For instance, t would have been 1, if
all samples had the same type of variation at position
x (which would not be very interesting, because in
this case position x would have zero contribution to a
deeper understanding of how these samples differ based
on variability. When t > 1, it would mean that
identities at position x across samples do differ. But
how much scattering occurs based on position x when t
> 1? If t=2, how many samples ended in each group?
Obviously, even distribution of samples across groups
may tell us something different than uneven
distribution of samples across groups. So, this
parameter filters out any x if 'the number of samples
in the second largest group' (=scatter) is less than
-m. Here is an example: lets assume you have 7
samples. While 5 of those have AG, 2 of them have TC
at position x. This would mean scatter of x is 2. If
you set -m to 2, this position would not be reported
in your output matrix. The default value for -m is 0,
which means every x found in the database and
survived previous filtering criteria will be reported.
Naturally, -m can not be more than half of the number
of samples. Please refer to the user documentation if
this is confusing.
-r RATIO, --min-ratio-of-competings-nts RATIO
Minimum ratio of the competing nucleotides at a given
position. Default is 0.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-gen-variability-network

A program to generate a network description from an anvi'o variability profile.

Usage

anvi-gen-variability-network [-h] -i VARIABILITY_PROFILE
[-n NUM_POSITIONS] [-o FILE_PATH]


Parameters

optional arguments:

  -i VARIABILITY_PROFILE, --input-file VARIABILITY_PROFILE
The anvi'o variability profile. Please see anvi-gen-
variability-profile to generate one.
-n NUM_POSITIONS, --max-num-unique-positions NUM_POSITIONS
Maximum number of unique positions to be used in the
network. This may be one way to avoid extremely large
network descriptions that would defeat the purpose of
a quick visualization. If there are more unique
positions in the variability profile, the program will
randomly select a subset of them to match the max-
num-unique-positions. The default is 0, which means
all positions should be reported. Remember that the
number of nodes in the network will also depend on the
number of samples described in the variability
profile.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-gen-variability-profile

Extract information for variable positions

Usage

anvi-gen-variability-profile [-h] [--splits-of-interest FILE]
[-C COLLECTION_NAME] [-b BIN_NAME]
[--genes-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS] -p
PROFILE_DB -c CONTIGS_DB [-s STRUCTURE_DB]
[-o FILE_PATH]
[--samples-of-interest FILE]
[--quince-mode] [--include-contig-names]
[--include-split-names]
[--compute-gene-coverage-stats]
[--engine ENGINE]
[--num-positions-from-each-split INT]
[--min-coverage-in-each-sample INT]
[-r FLOAT] [-z FLOAT] [-j FLOAT]
[-a FLOAT] [-x NUM_SAMPLES]
[--only-if-structure] [--skip-synonymity]


Parameters

DATABASES: Declaring relevant anvi'o databases. First things first.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.


SPLITS: Declaring relevant splits for the analysis. There are three ways to do it: 1) you can give a file path with split names, 2) you can provide a collection id with a bin name, or 3) you can give a list or a file path containing gene caller ids.

  --splits-of-interest FILE
A file with split names. There should be only one
column in the file, and each line should correspond to
a unique split name.
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
--genes-of-interest FILE
A file with anvi'o gene caller IDs. There should be
only one column in the file, and each line should
correspond to a unique gene caller id (without a
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately mistakes are cheap, so it's
worth a try.


OUTPUT: Output file and output style

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
--samples-of-interest FILE
A file with samples names. There should be only one
column in the file, and each line should correspond to
a unique sample name (without a column header).
--quince-mode         The default behavior is to report base frequencies of
nucleotide positions only if there is any variation
reported during profiling (which by default uses some
heuristics to minimize the impact of error-driven
variation). So, if there are 10 samples, and a given
position has been reported as a varaible site during
profiling in only one of those samples, there will be
no information will be stored in the database for the
remaining 9. When this flag is used, we go back to
each sample, and report base frequencies for each
sample at this position even if they do not vary. It
will take considerably longer to report when this flag
is on, and the use of it will increase the file size
dramatically, however it is inevitable for some
statistical approaches (as well as for some beautiful
visualizations).
--include-contig-names
Use this flag if you would like contig names for each
variable position to be included in the output file as
a column. By default, we do not include contig names
since they can practically double the output file size
without any actual benefit in most cases.
--include-split-names
Use this flag if you would like split names for each
variable position to be included in the output file as
a column.
--compute-gene-coverage-stats
If provided, gene coverage statistics will be appended
to the table for each entry. This is very useful
information, but will not be included by default
because it is an expensive opeation, and you a busy
person.


EXTRAS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…

  --engine ENGINE       Varaibility engine. The default is 'NT'.
--num-positions-from-each-split INT
Each split may have one or more variable positions. By
default, anvi'o will report every SNV position found
define a cutoff for the maximum number of SNVs to be
reported from a split (if the number of SNVs is more
than the number you declare using this parameter, the
positions will be randomly subsampled).
--min-coverage-in-each-sample INT
Minimum coverage of a given variable nucleotide
position in all samples. If a nucleotide position is
covered less than this value even in one sample, it
will be removed from the analysis. Default is 0.
-r FLOAT, --min-departure-from-reference FLOAT
Takes a value between 0 and 1, where 1 is maximum
divergence from the reference. Default is 0.000000.
The reference here observation that corresponds to a
given position in the mapped context.
-z FLOAT, --max-departure-from-reference FLOAT
Similar to '--min-departure-from-reference', but
defines an upper limit for divergence. The default is
1.000000.
-j FLOAT, --min-departure-from-consensus FLOAT
Takes a value between 0 and 1, where 1 is maximum
divergence from the consensus for a given position.
The default is 0.000000. The consensus is the most
frequent observation at a given positon.
-a FLOAT, --max-departure-from-consensus FLOAT
Similar to '--min-departure-from-consensus', but
defines an upper limit for divergence. The default is
1.000000.
-x NUM_SAMPLES, --min-occurrence NUM_SAMPLES
Minimum number of samples a nucleotide position should
be reported as variable. Default is 1. If you set it
to 2, for instance, each eligable variable position
will be expected to appear in at least two samples,
which will reduce the impact of stochastic, or
unintelligeable varaible positions.
--only-if-structure   If provided, your genes of interest will be further
subset to only include genes with structures in your
structure database, and therefore must be supplied in
conjunction with a structure database, i.e. -s
<your_structure_database>. If you did not specify
genes of interest, ALL genes will be subset to those
that have structures.
--skip-synonymity     Computing synonymity can be an expensive operation for
large data sets. Provide this flag to skip computing
synonymity. It only makes sense to provide this flag
when using --engine CDN.


## anvi-get-aa-counts

Collects AA counts information from a contigs database for a given bin, set of contigs, or set of genes.

Usage

anvi-get-aa-counts [-h] -c CONTIGS_DB [-o FILE_PATH] [-p PROFILE_DB]
[-C COLLECTION_NAME] [-B FILE_PATH]
[--contigs-of-interest FILE]
[--gene-caller-ids GENE_CALLER_IDS]


Parameters

MANDATORY STUFF: You have to set the following two parameters, then you will select one set of parameters from the following optional sections. If you select nothing from those sets, AA counts for everything in the contigs database will be reported.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


OPTIONAL PARAMS FOR BINS:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).


OPTIONAL PARAMS FOR CONTIGS:

  --contigs-of-interest FILE
A file with contig names. There should be only one
column in the file, and each line should correspond to
a unique split name.


OPTIONAL PARAMS FOR GENE CALLS:

  --gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately mistakes are cheap, so it's
worth a try.


## anvi-get-codon-frequencies

Usage

anvi-get-codon-frequencies [-h] -i INPUT_BAM -c CONTIGS_DB
--gene-caller-id GENE_CALLER_ID
FILE_PATH


Parameters

optional arguments:

  -i INPUT_BAM, --input-file INPUT_BAM
Sorted and indexed BAM file to analyze.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--gene-caller-id GENE_CALLER_ID
A single gene id.
By default, anvi'o will return codon frequencies (as
the name suggests), but you can ask for amino acid
frequencies instead, simply because you always need
more data and more stuff. You're lucky this time, but
is there an end to this? Will you ever be satisfied
with what you have? Anvi'o needs answers.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-get-enriched-functions-per-pan-group

A program that takes a pangenome, and a categorical layer additional data item, and reports back the functions enriched in per category.

Usage

anvi-get-enriched-functions-per-pan-group [-h] -p PAN_DB
[-g GENOMES_STORAGE]
[--category-variable CATEGORY]
[--annotation-source SOURCE NAME]
[-l] -o FILE_PATH [-F FILE]
[-E FLOAT]
[--core-threshold FLOAT]
[-P FLOAT]
[--false-detection-rate FLOAT]


Parameters

INPUT FILES: Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file


CATEGORY VARIABLE AND FUNCTIONAL ANNOTATION SOURCE: This is the additional layer data item in which your genomes are split into multiple groups. So anvi'o can figure out what functions are specific to each group of genomes in your pangenomic analysis. If this is not making any sense, please take a look at the online tutorial for pangenomics (http://merenlab.org/2016/11/08/pangenomics-v2/).

  --category-variable CATEGORY
The additional layers data variable name that divides
layers into multiple categories.
--annotation-source SOURCE NAME
Get functional annotations for a specific annotation
source. You can use the flag '--list-annotation-
sources' to learn about what sources are available.
-l, --list-annotation-sources
List available functional annotation sources.


REPORTING: Output and stuff.

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
-F FILE, --functional-occurrence-table-output FILE
Saves the presence/absence information for functions
in genomes in a TAB-delimited format. A file name must
presence/absence is computed, please refer to the
tutorial.


  -E FLOAT, --min-function-enrichment FLOAT
Only report functions for which the min enrichment is
above the provided value. Default is 0.0.
--core-threshold FLOAT
Takes a value between 0 and 1, where 1 means that only
functions occuring in all genomes of a group would be
considered as core functions of that group. Default is
1.0.
-P FLOAT, --min-portion-occurrence-of-function-in-group FLOAT
Takes a value between 0 and 1, where 1 means that only
functions that occur in all members of one of the
compared groups will be included in the output.
Default is 0.0.
--false-detection-rate FLOAT, --FDR FLOAT
Takes a value between 0 and 1, to determine the false
detection rate that will be used for the
Benjamini–Hochberg procedure. Default is 0.1.


## anvi-get-sequences-for-gene-calls

A script to get back sequences of a list of genes

Usage

anvi-get-sequences-for-gene-calls [-h] -c CONTIGS_DB
[--gene-caller-ids GENE_CALLER_IDS]
-o FILE_PATH [--delimiter CHAR]
[--report-extended-deflines]
[--wrap WRAP] [--export-gff3]
[--get-aa-sequences]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--gene-caller-ids GENE_CALLER_IDS
Gene caller ids. Multiple of them can be declared
separated by a delimiter (the default is a comma). In
anvi-gen-variability-profile, if you declare nothing
you will get all genes matching your other filtering
criteria. In other programs, you may get everything,
nothing, or an error. It really depends on the
situation. Fortunately mistakes are cheap, so it's
worth a try.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--delimiter CHAR      The delimiter to parse multiple input terms. The
default is ','.
--report-extended-deflines
When declared, the deflines in the resulting FASTA
--wrap WRAP           When to wrap sequences when storing them in a FASTA
file. The default is '120'. A value of '0' would be
equivalent to 'do not wrap'.
--export-gff3         If this is true, the output file will be in GFF3
format.
--get-aa-sequences    Store amino acid sequences instead.


## anvi-get-sequences-for-gene-clusters

Do cool stuff with gene clusters in anvi'o pan genomes

Usage

anvi-get-sequences-for-gene-clusters [-h] -p PAN_DB
[-g GENOMES_STORAGE]
[-o FILE_PATH]
[--report-DNA-sequences]
[--gene-cluster-id GENE_CLUSTER_ID]
[--gene-cluster-ids-file FILE_PATH]
[-C COLLECTION_NAME] [-b BIN_NAME]
[--min-num-genomes-gene-cluster-occurs INTEGER]
[--max-num-genomes-gene-cluster-occurs INTEGER]
[--min-num-genes-from-each-genome INTEGER]
[--max-num-genes-from-each-genome INTEGER]
[--max-num-gene-clusters-missing-from-genome INTEGER]
[--min-functional-homogeneity-index FLOAT]
[--max-functional-homogeneity-index FLOAT]
[--min-geometric-homogeneity-index FLOAT]
[--max-geometric-homogeneity-index FLOAT]
[--list-collections] [--list-bins]
[--concatenate-gene-clusters]
[--separator STRING]
[--align-with ALIGNER]
[--list-aligners] [--just-do-it]


Parameters

INPUT FILES: Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file


OUTPUT: You get to chose an output file name to report things. The default will be an ugly name. So, be explicit.

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
--report-DNA-sequences
By default, this program reports amino acid sequences.
You can change that behavior and as for DNA sequences


SELECTION: Which gene clusters should be reported. You can ask for a single gene cluster, or multiple ones listed in a file, or you can use a collection and bin name to list gene clusters of interest. If you give nothing, this program will export alignments for every single gene cluster found in the profile database (and this is called 'customer service').

  --gene-cluster-id GENE_CLUSTER_ID
Gene cluster ID you are interested in.
--gene-cluster-ids-file FILE_PATH
Text file for gene clusters (each line should contain
be a unique gene cluster id).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.


ADVANCED FILTERS: If you are here you must be looking for ways to specify exactly what you want from that database of gene clusters. These filters will be applied to what your previous selections reported.

  --min-num-genomes-gene-cluster-occurs INTEGER
This filter will remove gene clusters from your
report. Let's assume you have 100 genomes in your pan
genome analysis. You can use this parameter if you
want to work only with gene clusters that occur in at
least X number of genomes. If you say '--min-num-
genomes-gene-cluster-occurs 90', each gene cluster in
the analysis will be required at least to appear in 90
genomes. If a gene occurs in less than that number of
genomes, it simply will not be reported. This is
especially useful for phylogenomic analyses, where you
may want to only focus on gene clusters that are
prevalent across the set of genomes you wish to
analyze.
--max-num-genomes-gene-cluster-occurs INTEGER
This filter will remove gene clusters from your
report. Let's assume you have 100 genomes in your pan
genome analysis. You can use this parameter if you
want to work only with gene clusters that occur in at
most X number of genomes. If you say '--max-num-
genomes-gene-cluster-occurs 1', you will get gene
clusters that are singletons. Combining this paramter
with --min-num-genomes-gene-cluster-occurs can give
you a very precise way to filter your gene clusters.
--min-num-genes-from-each-genome INTEGER
This filter will remove gene clusters from your
report. If you say '--min-num-genes-from-each-genome
2', this filter will remove every gene cluster, to
which every genome in your analysis contributed less
than 2 genes. This can be useful to find out gene
clusters with many genes from many genomes (such as
conserved multi-copy genes within a clade).
--max-num-genes-from-each-genome INTEGER
This filter will remove gene clusters from your
report. If you say '--max-num-genes-from-each-genome
1', every gene cluster that has more than one gene
from any genome that contributes to it will be removed
from your analysis. This could be useful to remove
gene clusters with paralogs from your report for
appropriate phylogenomic analyses. For instance, using
'--max-num-genes-from-each-genome 1' and 'min-num-
genomes-gene-cluster-occurs X' where X is the total
number of your genomes, would give you the single-copy
gene cluters in your pan genome.
--max-num-gene-clusters-missing-from-genome INTEGER
This filter will remove genomes from your report. If
you have a list of gene cluster names, you can use
this parameter to omit any genome from your report if
it is missing more than a number of genes you desire.
For instance, if you have 100 genomes in your pan
genome, and you are interested in working only with
genomes that have all 5 specific gene clusters of your
choice, you can use '--max-num-gene-clusters-missing-
from-genome 4' to remove remove the bins that are
missing more than 4 of those 5 genes. This is
especially useful for phylogenomic analyses. Parameter
0 will remove any genome that is missing any of the
genes.
--min-functional-homogeneity-index FLOAT
This filter will remove genoe clusters from your
report. If you say '--min-functional-homogeneity-index
0.3', every gene cluster with a functional homogeneity
index less than 0.3 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that are highly conserved in
resulting funciton
--max-functional-homogeneity-index FLOAT
This filter will remove genoe clusters from your
report. If you say '--max-functional-homogeneity-index
0.5', every gene cluster with a functional homogeneity
index greater than 0.5 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that don't seem to be functionally
conserved
--min-geometric-homogeneity-index FLOAT
This filter will remove genoe clusters from your
report. If you say '--min-geometric-homogeneity-index
0.3', every gene cluster with a geometric homogeneity
index less than 0.3 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that are highly conserved in
geometric configuration
--max-geometric-homogeneity-index FLOAT
This filter will remove genoe clusters from your
report. If you say '--max-geometric-homogeneity-index
0.5', every gene cluster with a geometric homogeneity
index greater than 0.5 will be removed from your
analysis. This can be useful if you only want to look
at gene clusters that have many not be as conserved as
others
If you use any of the filters, and would like to add
the resulting item names into the items additional
data table of your database, you can use this
parameter. You will need to give a name for these
results to be saved. If the given name is already in
the items additoinal data table, its contents will be
replaced with the new one. Then you can run anvi-
interactive or anvi-display-pan to 'see' the results


OTHER STUFF: Yes. Stuff that are not like the ones above.

  --list-collections    Show available collections and exit.
--list-bins           List available bins in a collection and exit.


CONCATENATED OUTPUT: Concatenated output for phylogenomics.

  --concatenate-gene-clusters
Concatenate output gene clusters in the same order to
create a multi-gene alignment output that is suitable
for phylogenomic analyses.
--separator STRING    Characters to separate things (the default is whatever
is most suitable).
--align-with ALIGNER  The multiple sequnce alignment program to use when
multiple seqeunce alignment is necessary. To see all
available optons, use the flag --list-aligners.
--list-aligners       Show available software for multiple sequence
alignment.


TOTALLY IRRELEVANT: Just in case you need it.

  --just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-get-sequences-for-hmm-hits

Get sequences for HMM hits from many inputs.

Usage

anvi-get-sequences-for-hmm-hits [-h] [-c CONTIGS_DB] [-p PROFILE_DB]
[-C COLLECTION_NAME] [-b BIN_NAME]
[-B FILE_PATH] [-e FILE_PATH]
[-i FILE_PATH]
[--hmm-sources SOURCE NAME]
[--gene-names HMM HIT NAME] [-l] [-L]
[-o FILE_PATH] [--get-aa-sequences]
[--concatenate-genes]
[--max-num-genes-missing-from-bin INTEGER]
[--min-num-bins-gene-occurs INTEGER]
[--align-with ALIGNER]
[--separator STRING]
[--return-best-hit]


Parameters

INPUT OPTION #1: CONTIGS DB: There are multiple ways to access to sequences. Your first option is to provide a contigs database, and call it a day. In this case the program will return you everything from it.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


INPUT OPTION #2: CONTIGS DB + PROFLIE DB: You can also work with anvi'o profile databases and collections stored in them. If you go this way, you still will need to provide a contigs database. If you just specify a collection name, you will get hits from every bin in it. You can also use the bin name or bin ids file parameters to specify your interest more precisely.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).


INPUT OPTION #3: EXTERNAL GENOMES FILE: If you have multiple contigs databases without any profile database, you can start with this one. In this case you are not supposed to provide a profile database or an individual contigs database. This is for people who want to use this just with a bunch of FASTA files with their genomes.

  -e FILE_PATH, --external-genomes FILE_PATH
A two-column TAB-delimited flat text file that lists
anvi'o contigs databases. The first item in the header
'contigs_db_path'. Each line in the file should
describe a single entry, where the first column is the
name of the genome (or MAG), and the second column is
the anvi'o contigs database generated for this genome.
-i FILE_PATH, --internal-genomes FILE_PATH
A four-column TAB-delimited flat text file. The header
line must contain thse columns: 'name', 'bin_id',
'collection_id', 'profile_db_path', 'contigs_db_path'.
Each line should list a single entry, where 'name' can
be any name to describe the anvi'o bin identified as
'bin_id' that is stored in a collection.


HMM STUFF: This is where you can specify an HMM source, and/or a list of genes to filter your results.

  --hmm-sources SOURCE NAME
Get sequences for a specific list of HMM sources. You
can list one or more sources by separating them from
each other with a comma character (i.e., '--hmm-
sources source_1,source_2,source_3'). If you would
like to see a list of available sources in the contigs
database, run this program with '--list-hmm-sources'
flag.
--gene-names HMM HIT NAME
Get sequences only for a specific gene name. Each name
should be separated from each other by a comma
character. For instance, if you want to get back only
RecA and Ribosomal_L27, you can type '--gene-names
RecA,Ribosomal_L27', and you will get any and every
hit that matches these names in any source. If you
would like to see a list of available gene names, you
can use '--list-available-gene-names' flag.
-l, --list-hmm-sources
List available HMM sources in the contigs database and
quit.
-L, --list-available-gene-names
List available gene names in HMM sources selection and
quit.


THE OUTPUT: Where should the output go. It will be a FASTA file, and you better give it a nice name..

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.


THE ALPHABET: The sequences are reported in DNA alphabet, but you can also get them translated just like all the other cool kids.

  --get-aa-sequences    Store amino acid sequences instead.


PHYLOGENOMICS? K!: If you want, you can get your sequences concatanated. In this case anwi'o will use muscle to align every homolog, and concatenate them the order you specified using the gene-names argument. Each concatenated sequence will be separated from the other ones by the separator.

  --concatenate-genes   Concatenate output genes in the same order to create a
multi-gene alignment output that is suitable for
phylogenomic analyses.
--max-num-genes-missing-from-bin INTEGER
This filter removes bins (or genomes) from your
analysis. If you have a list of gene names, you can
use this parameter to omit any bin (or external
genome) that is missing more than a number of genes
you desire. For instance, if you have 100 genome bins,
and you are interested in working with 5 ribosomal
proteins, you can use '--max-num-genes-missing-from-
bin 4' to remove remove the bins that are missing more
than 4 of those 5 genes. This is especially useful for
phylogenomic analyses. Parameter 0 will remove any bin
that is missing any of the genes.
--min-num-bins-gene-occurs INTEGER
This filter removes genes from your analysis. Let's
assume you have 100 bins to get sequences for HMM
hits. If you want to work only with genes among all
the hits that occur in at least X number of bins, and
discard the rest of them, you can use this flag. If
you say '--min-num-bins-gene-occurs 90', each gene in
the analysis will be required at least to appear in 90
genomes. If a gene occurs in less than that number of
genomes, it simply will not be reported. This is
especially useful for phylogenomic analyses, where you
may want to only focus on genes that are prevalent
across the set of genomes you wish to analyze.
--align-with ALIGNER  The multiple sequnce alignment program to use when
multiple seqeunce alignment is necessary. To see all
available optons, use the flag --list-aligners.
--separator STRING    A word that will be used to sepaate concatenated gene
sequences from each other (IF you are using this
program with --concatenate-genes flag). The default
is "XXX" for amino acid sequences, and "NNN" for DNA
sequences


OPTIONAL: Everything is optional, but some options are more optional than others.

  --return-best-hit     A bin may contain more than one hit for a gene name in
a given HMM source. For instance, there may be
multiple RecA hits in a genome bin from Campbell et
al.. Using this flag, will go through all of the gene
names that appear multiple times, and remove all but
the one with the lowest e-value. Good for whenever you
really need to get only a single copy of single-copy
core genes from a genome bin.


Get short reads back from a BAM file.

Usage

anvi-get-short-reads-from-bam [-h] -p PROFILE_DB -c CONTIGS_DB
[-C COLLECTION_NAME] [-b BIN_NAME]
[-B FILE_PATH] [-o FILE_PATH]
[-O FILENAME_PREFIX] [-X] [-Q]
BAM FILE[S] [BAM FILE[S] ...]


Parameters

positional arguments:

  BAM FILE[S]           BAM file(s) to access to recover short reads


optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
-X, --gzip-output     When declared, output file(s) will be gzip compressed
and the extension .gz will be added.
-Q, --split-R1-and-R2
When declared, this program outputs 3 FASTA files for
paired-end reads: one for R1, one for R2, and one for


Access reads in contigs and positions in a BAM file

Usage

anvi-get-short-reads-mapping-to-a-gene [-h] -i INPUT_BAMS)
[INPUT_BAM(S ...] -c CONTIGS_DB
--gene-caller-id GENE_CALLER_ID
-o FILE_PATH
[--leeway LEEWAY_NTs]


Parameters

optional arguments:

  -i INPUT_BAM(S) [INPUT_BAM(S) ...], --input-files INPUT_BAM(S) [INPUT_BAM(S) ...]
Sorted and indexed BAM files to analyze. It is
essential that all BAM files must be the result of
mappings against the same contigs.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--gene-caller-id GENE_CALLER_ID
A single gene id.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--leeway LEEWAY_NTs   The minimum number of nucleotides for a given short
read mapping into the gene context for it to be
reported. You must consider the length of your short
reads, as well as the length of the gene you are
targeting. The default is 100 nts.


## anvi-get-split-coverages

Export splits and the coverage table from database

Usage

anvi-get-split-coverages [-h] -p PROFILE_DB [--split-name SPLIT_NAME]
[-c CONTIGS_DB] [-C COLLECTION_NAME]
[-b BIN_NAME] [-o FILE_PATH] [--list-splits]
[--list-collections] [--list-bins]


Parameters

ESSENTIAL ANVI'O DB: You need to provide a profile database.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database


INPUT OPTION #1: SPLIT NAME: You want nothing but the coverage values in a single split. FINE.

  --split-name SPLIT_NAME
Split name.


INPUT OPTION #2: COLLECTION + BIN: You want nucletide-level coverage values for all splits in a bin. FANCY.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.


BORING STUFF: The output file and all.

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
--list-splits         When declared, the program will list split names in
the profile database and quite
--list-collections    Show available collections and exit.
--list-bins           List available bins in a collection and exit.


## anvi-import-collection

Import an external binning result into anvi'o

Usage

anvi-import-collection [-h] [-c CONTIGS_DB] [-p PAN_OR_PROFILE_DB] -C
COLLECTION_NAME [--bins-info BINS_INFO]
[--contigs-mode]
TAB DELIMITED FILE


Parameters

positional arguments:

  TAB DELIMITED FILE    The input file that describes bin IDs for each split
or contig.


optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--bins-info BINS_INFO
Additional information for bins. The file must contain
three TAB-delimited columns, where the first one must
be a unique bin name, the second should be a 'source',
and the last one should be a 7 character HTML color
code (i.e., '#424242'). Source column must contain
information about the origin of the bin. If these bins
are automatically identified by a program like
CONCOCT, this column could contain the program name
and version. The source information will be associated
with the bin in various interfaces so in a sense it is
not *that* critical what it says there, but on the
other hand it is, becuse we should also think about
people who may end up having to work with what we put
together later.
--contigs-mode        Use this flag if your binning was done on contigs
for help.


## anvi-import-functions

Parse and store functional annotation of genes.

Usage

anvi-import-functions [-h] -c CONTIGS_DB [-p PARSER] -i FILES)
[FILE(S ...] [--drop-previous-annotations]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PARSER, --parser PARSER
Parser to make sense of the input files (if you need
one). There are currently 1 parsers readily available:
['interproscan']. IT IS OK if you do not select a
parser if you have a standard, TAB-delimited input
file for funcitonal annotation of genes. If this is
not like 2018 and everything is already outdated, you
should be able to go to this address and learn
everything you need like a boss:
http://merenlab.org/2016/06/18/importing-functions/
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
One or more input files should follow this parameter.
The way these files will be handled will depend on
which parser you selected (if you did select any).
--drop-previous-annotations
Use this flag if you want anvi'o to remove ALL
previous functional annotations for your genes, and
then import the new data. The default behavior will
add any annotation source into the db incrementally
unless there are already annotations from this source.
In which case, it will first remove previous
annotations for that source only (i.e., if source X is
both in the db and in the incoming annotations data,
it will replace the content of source X in the db).


## anvi-import-misc-data

Populate additional data or order tables in pan or profile databases for items or layers (the Swiss army knife-level serious stuff).

Usage

anvi-import-misc-data [-h] -p PAN_OR_PROFILE_DB -t NAME [-D NAME]
[--transpose] [--just-do-it]
TAB DELIMITED FILE


Parameters

positional arguments:

  TAB DELIMITED FILE    The input file that describes an additional data for
layers or items. The expected format of this file
depends on the data table you will target. This can
feel complicated, but we promise it is not (you
probably have a PhD or working on one, so trust us
when we say "it is not complicated"). You need to read
the online documentation if this is your first time
with this.


optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
-D NAME, --target-data-group NAME
Data group to focus. Anvi'o misc data tables support
associating a set of data keys with a data group. If
you have no idea what this is, then probably you don't
need it, and anvi'o will take care of you. Note: this
flag is IRRELEVANT if you are working with additioanl
order data tables.
--transpose           Transpose the input matrix file before clustering.
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-import-state

Import an anvi'o state into a profile database.

Usage

anvi-import-state [-h] -p PAN_OR_PROFILE_DB -s STATE_FILE -n STATE_NAME


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-s STATE_FILE, --state STATE_FILE
JSON serializable anvi'o state file.
-n STATE_NAME, --name STATE_NAME
State name.


## anvi-import-taxonomy-for-genes

Import gene-level taxonomy into an anvi'o contigs database.

Usage

anvi-import-taxonomy-for-genes [-h] -c CONTIGS_DB [-p PARSER] -i FILES)
[FILE(S ...] [--just-do-it]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PARSER, --parser PARSER
Parser to make sense of the input files. There are 3
'centrifuge', 'kaiju']. It is OK if you do not select
a parser, but in that case there will be no additional
contigs available except the identification of single-
copy genes in your contigs for later use. Using a
parser will not prevent the analysis of single-copy
or get in touch with the developers if you have any
questions regarding parsers.
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
Input file(s) for selected parser. Each parser (except
"blank") requires input files to process that you
generate before running anvio. Please see the
documentation for details.
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-import-taxonomy-for-layers

Import layers-level taxonomy into an anvi'o additional layer data table in an anvi'o single-profile database.

Usage

anvi-import-taxonomy-for-layers [-h] -p PROFILE_DB [--parser PARSER] -i
FILES) [FILE(S ...]
[--min-abundance PERCENTAGE]


Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
--parser PARSER       Parser to make sense of the input files. There are 1
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
Input file(s) for selected parser. Each parser (except
"blank") requires input files to process that you
generate before running anvio. Please see the
documentation for details.
--min-abundance PERCENTAGE
Short read-based taxonomy can be extremely noisy.
Therefore, here we have defeault minimum percentage
cutoff of 0.1% to eliminate any taxon that occurs less
than that in a given input file.


## anvi-init-bam

Sort/Index BAM files

Usage

anvi-init-bam [-h] [-o FILE_PATH] BAM_FILE


Parameters

positional arguments:

  BAM_FILE              BAM file to analyze


optional arguments:

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-interactive

Start an anvi'o server for the interactive interface

Usage

anvi-interactive [-h] [-p PROFILE_DB] [-c CONTIGS_DB]
[-C COLLECTION_NAME] [--manual-mode] [-f FASTA]
[-d VIEW_DATA] [-t NEWICK] [--items-order FLAT_FILE]
[--view NAME] [--title NAME]
[--taxonomic-level {t_phylum,t_class,t_order,t_family,t_genus,t_species}]
[--split-hmm-layers] [--hide-outlier-SNVs]
[--export-svg FILE_PATH] [--gene-mode] [-b BIN_NAME]
[--show-views] [--skip-check-names] [-o DIR_PATH]
[--dry-run] [--show-states] [--list-collections]
[--skip-init-functions] [--skip-auto-ordering]
[--distance DISTANCE_METRIC]


Parameters

DEFAULT INPUTS: The interavtive interface can be started with and without anvi'o databases. The default use assumes you have your profile and contigs database, however, it is also possible to start the interface using ad hoc input files. See 'MANUAL INPUT' section for required parameters.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
If you have a collection in your profile database, you
can use this flag to start the interactive interface
instead of each split. This is very useful when you
have imported your external binning results into
anvi'o, and want to see the distribution of your bins
across samples. In these cases anvi'o will cluster
your bins and based on multiple metrics. Because this
particular clustering will be done on the fly within
anvi'o interactive class, you get to define a
and --distance parameters if you want!


MANUAL INPUTS: Mandatory input parameters to start the interactive interface without anvi'o databases.

  --manual-mode         Using this flag, you can run the interactive interface
in an ad hoc manner using input files you curated
instead of standard output files generated by an
anvi'o run. In the manual mode you will be asked to
provide a profile database. In this mode a profile
database is only used to store 'state' of the
settings when you re-analyze the same files again. If
the profile database you provide does not exist,
anvi'o will create an empty one for you.
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-d VIEW_DATA, --view-data VIEW_DATA
A TAB-delimited file for view data
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure
--items-order FLAT_FILE
A flat file that contains the order of items you wish
the display using the interactive interface. You may
want to use this if you have a specific order of items
in your mind, and do not want to display a tree in the
middle (or simply you don't have one). The file format
is simple: each line should have an item name, and


  -V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
A TAB-delimited file for an additional view to be used
in the interface. This file file should contain all
split names, and values for each of them in all
samples. Each column in this file must correspond to a
sample name. Content of this file will be called
'user_vuew', which will be available as a new item in
the 'views' combo box in the interface
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.


GENE MODE: Gene mode related parameters.

  --gene-mode           Initiate the interactive interface in "gene mode". In
this mode, the items are genes (instead of splits of
contigs). The following views are avilable: detection
(the detection value of each gene in each sample). The
mean_coverage (the mean coverage of genes). The
non_outlier_mean_coverage (the mean coverage of the
non-outlier nucleotide positions of each gene in each
sample (median absolute deviation is used to remove
outliers per gene per sample)). The
the coverage of non-outlier positions of genes in
samples). You can also choose to order items and
layers according to each one of the aforementioned
views. In addition, all layer ordering that are
avialable in the regular mode (i.e. the full mode
where you have contigs/splits) are also available in
"gene mode", so that, for example, you can choose to
order the layers according to "detection", and that
would be the order according to the detection values
of splits, whereas if you choose "genes_detections"
then the order of layers would be according to the
detection values of genes. Inspection and sequence
functionality are available (through the right-click
menu), except now sequences are of the specific gene.
Inspection has now two options available: "Inspect
Context", which brings you to the inspection page of
the split to which the gene belongs where the
inspected gene will be highlighted in yellow in the
bottom, and "Inspect Gene", whih opens the inspection
page only for the gene and 100 nts around each side of
it (the purpose of this option is to make the
inspection page load faster if you only want to look
at the nucleotide coverage of a specific gene).
NOTICE: You can't store states or collections in "gene
mode". However, you still can make fake selections,
and create fake bins for your viewing covenience only
(smiley). Search options are available, and you can
even search for functions if you have them in your
might take a while if your bin has many genes, and
your profile database has many samples, this is
beacause the gene coverages stats are computed in an
not ideal and we plan to improve that (along with
other things). If you have suggestions/complaints
regarding this mode please comment on this github
issue: https://goo.gl/yHhRei. Please refer to the
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.


  --view NAME           Start the interface with a pre-selected view. To see a
list of available views, use --show-views flag.
--title NAME          Title for the interface. If you are working with a
RUNINFO dict, the title will be determined based on
information stored in that file. Regardless, you can
override that value using this parameter. If you are
not using a anvio RUNINFO dictionary, a meaningful
title will appear in the interface only if you define
one using this parameter.
--taxonomic-level {t_phylum,t_class,t_order,t_family,t_genus,t_species}
The taxonomic level to use. The default is 't_genus'.
Only relevant if the anvi'o ontigs database contains
taxonomic annotations.
--split-hmm-layers    When declared, this flag tells the interface to split
every gene found in HMM searches that were performed
against non-singlecopy gene HMM profiles into their
own layer. Please see the documentation for details.
--hide-outlier-SNVs   During profiling, anvi'o marks positions of single-
nucleotide variations (SNVs) that originate from
places in contigs where coverage values are a bit
'sketchy'. If you would like to avoid SNVs in those
positions of splits in applicable projects you can use
this flag, and the interafce would hide SNVs that are
marked as 'outlier' (although it is clearly the best
to see everything, no one will judge you if you end up
using this flag) (plus, there may or may not be some
historical data on this here:
https://github.com/meren/anvio/issues/309).
Automatically load previous saved state and draw tree.
To see a list of available states, use --show-states
flag.
Automatically load a collection and draw tree. To see
a list of available collections, use --list-
collections flag.
--export-svg FILE_PATH
The SVG output file path.


SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).

  --show-views          When declared, the program will show a list of
available views, and exit.
--skip-check-names    For debugging purposes. You should never really need
it.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
--dry-run             Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--show-states         When declared the program will print all available
states and exit.
--list-collections    Show available collections and exit.
--skip-init-functions
When declared, function calls for genes will not be
initialized (therefore will be missing from all
relevant interfaces or output files). The use of this
flag may reduce the memory fingerprint and processing
time for large datasets.
--skip-auto-ordering  When declared, the attempt to include automatically
generated orders of items based on additional data is
skipped. In case those buggers cause issues with your
data, and you still want to see your stuff and deal
with the other issue maybe later.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
Only relevant if you are running the interactive
interface in "collection" mode. The default is
"euclidean".
The linkage method for the hierarchical clustering.
Only relevant if you are running the interactive
interface in "collection" mode. The default is "ward".


SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH   By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--read-only           When the interactive interface is started with this
flag, all 'database write' operations will be
disabled.
--server-only         The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.


## anvi-matrix-to-newick

Takes an observation matrix, returns a newick tree.

Usage

anvi-matrix-to-newick [-h] [-o FILE_PATH] [--transpose]
[--distance DISTANCE_METRIC]
PATH


Parameters

positional arguments:

  PATH                  Input matrix


optional arguments:

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
--transpose           Transpose the input matrix file before clustering.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
The default distance metric is 'euclidean'. You can
find the full list of distance metrics either by
making a mistake (such as entering a non-existent
distance metric and making anvi'o upset), or by taking
a look at the help menu of the
hierarchy.distance.pdist function in the scipy.cluster
module.
The linkage method for the hierarchical clustering.
The default linkage method is 'ward', because that is
the best one. It really is. We talked to a lot of
people and they were all like 'this is the best one
available' and it is just all out there. Honestly it
is so good that we will build a wall around it and
make other linkage methods pay for it. But if you want
to see a full list of available ones you can check the
hierarcy.linkage function in the scipy.cluster module.
Up tp you really. But then you can't use ward anymore,
and you would have to leave anvi'o right now.


## anvi-mcg-classifier

A program to classify genes according to coverage across multiple metagenomes

Usage

anvi-mcg-classifier [-h] -p PROFILE_DB -c CONTIGS_DB
[-O FILENAME_PREFIX] [-C COLLECTION_NAME]
[-b BIN_NAME] [-B FILE_PATH]
[--exclude-samples FILE] [--include-samples FILE]
[--gen-figures] [-W] [--alpha NUM]
[--outliers-threshold NUM] [--zeros-are-outliers]


Parameters

ESSENTIAL INPUTS: You must supply a merged profile db (along with a matching contigs db)

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


ESSENTIAL OUTPUTS: The outputs of the algorithm are: an anvio additional layers format file with the classification information for genes. An anvio samples information file with detectino information per sample. In addition, when a profile database is given then a gene-coverages, and gene-detection tables would also be saved. All files are created with the prefix that is provided by the user.

  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
A prefix to be used while naming the output files (no
file type extensions please; just a prefix).
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.


ADDITIONAL STUFF: Parameters to provide pre-existing additional layers, samples-information files, so that the outputs would be added to these files

  -b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).
--exclude-samples FILE
List of samples to exclude for the analysis.
--include-samples FILE
List of samples to include for the analysis.
--gen-figures         For those of you who wish to dig deeper, a collection
of figures could be created to allow you to get
insight into how the classification was generated.
This is especially useful to identify cases in which
you shouldn't trust the classification (for example
due to a large number of outliers). NOTICE: if you ask
anvi'o to generate these figures then it will
significantly extend the execution time. To learn
about which figures are created and what they mean,
contact your nearest anvi'o developer, because
currently it is a well-hidden secret.
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.


PARAMETERS: Parameters to determine cut-offs for the gene-classifier

  --alpha NUM, --genome-detection-uncertainty NUM
Determines the range of sample detection values that
are considered negative, ambiguous or positive. Min of
0 and smaller than 0.5, default of 0.25. For exmaple
for the default samples with detection below 0.5-0.25
= 0.25 will be considered negative (i.e. donot contain
the genome), samples with detection between 0.25 and
0.75 would be ambiguous (and hence would not be used
for the classification), and samples with detection
above 0.75 would be considered positive (i.e. contain
the genome).
--outliers-threshold NUM
Threshold to use for the outlier detection. The
default value is 2.5. Absolute deviation around the
refer to: Boris Iglewicz and David Hoaglin (1993),
"Volume 16: How to Detect and Handle Outliers", The
ASQC Basic References in Quality Control: Statistical
Techniques, Edward F. Mykytka, Ph.D., Editor. Or to: h
ttp://www.sciencedirect.com/science/article/pii/S00221
03113000668
--zeros-are-outliers  If you want all zero coverage positions to be treated
like outliers then use this flag. The reason to treat
zero coverage as outliers is because when mapping
reads to a reference we could get many zero positions
due to accessory genes. These positions then skew the
average values that we compute.


## anvi-merge

Merge multiple anvio profiles

Usage

anvi-merge [-h] -c CONTIGS_DB [-o DIR_PATH] [-S NAME]
[--description TEXT_FILE] [--skip-hierarchical-clustering]
[--enforce-hierarchical-clustering]
[--skip-concoct-binning] [-W]
SINGLE_PROFILES) [SINGLE_PROFILE(S ...]


Parameters

positional arguments:

  SINGLE_PROFILE(S)     Anvo'o single profiles to merge


optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-S NAME, --sample-name NAME
It is important to set a sample name (using only ASCII
letters and digits and without spaces) that is unique
(considering all others). If you do not provide one,
anvi'o will try to make up one for you based on other
information, although, you should never let the
software to decide these things).
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
--skip-hierarchical-clustering
If you are not planning to use the interactive
interface (or if you have other means to add a tree of
contigs in the database) you may skip the step where
hierarchical clustering of your items are preformed
based on default clustering recipes matching to your
database type.
--enforce-hierarchical-clustering
If you have more than 25,000 splits in your merged
profile, anvi-merge will automatically skip the
hierarchical clustering of splits (by setting --skip-
hierarchical-clustering flag on). This is due to the
fact that computational time required for hierarchical
clustering increases exponentially with the number of
items being clustered. Based on our experience we
decided that 25,000 splits is about the maximum we
should try. However, this is not a theoretical limit,
and you can overwrite this heuristic by using this
flag, which would tell anvi'o to attempt to cluster
splits regardless.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
If you do not use this flag, the default distance
metric will be used for each clustering configuration
which is "euclidean".
The same story with the --distance, except, the
system default for this one is ward.
--skip-concoct-binning
Anvi'o uses CONCOCT (Alneberg et al.) by default for
unsupervised genome binning for merged runs. CONCOCT
results are stored in the profile database, which then
can be used from within appropriate interfaces (i.e.,
anvi-interactive, anvi-summary, etc). Use this flag if
you would like to skip this step
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.


## anvi-merge-bins

Merge a given set of bins in an anvi'o collection

Usage

anvi-merge-bins [-h] -p PAN_OR_PROFILE_DB [-C COLLECTION_NAME]
[-b BIN NAMES] [-B BIN NAME] [--list-collections]
[--list-bins]


Parameters

DB AND COLLECTION: Simple enough. This guy needs a pan or profile database and a collection name. You can get a list of available collections with another flag down below.

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.


BINS TO WORK WITH: Here you need to define a list of bin names to merge, and the new bin name for them to merge under. Your bin names should be comma-separated. Both 'name_1, name_2, name_3' and name_1,name_2,name_3 will work. Your new bin name better be a single word, meaningful name so anvi'o does not complain about it later.

  -b BIN NAMES, --bin-names-list BIN NAMES
Comma-separated list of bin names.
-B BIN NAME, --new-bin-name BIN NAME
The new bin name.


SWEET FLAGS OF CONVENIENCE: We gotchu.

  --list-collections    Show available collections and exit.
--list-bins           List available bins in a collection and exit.


## anvi-meta-pan-genome

Convert a pangenome into a metapangenome.

Usage

anvi-meta-pan-genome [-h] -p PAN_DB [-g GENOMES_STORAGE] [-i FILE]
[--fraction-of-median-coverage FLOAT]
[--min-detection FLOAT]


Parameters

PANGENOME: Files for the pangenome.

  -p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file


METAGENOME: Genome bins stored in an anvi'o profile databases as collections.

  -i FILE, --internal-genomes FILE
A four-column TAB-delimited flat text file. This file
should be identical to the internal genomes file you
used for your pangenomics analysis. Anvi'o will use
this file to find all profile and contigs databases
that contain the information for each gene and genome
across metagenomes.


CRITERION FOR DETECTION: This is tricky. What we want to do is to identify genes that are occurring uniformly across samples.

  --fraction-of-median-coverage FLOAT
The value set here will be used to remove a gene if
its total coverage across environments is less than
the median coverage of all genes multiplied by this
value. The default is 0.25, which means, if the median
total coverage of all genes across all samples is
100X, then, a gene with a total coverage of less than
25X across all samples will be assumed not a part of
the 'environmental core'.
--min-detection FLOAT
For this entire thing to work, the genome you are
focusing on should be detected in at least one
metagenome. If that is not the case, it would mean
that you do not have any sample that represents the
niche for this organism (or you do not have enough
depth of coverage) to investigate the detection of
genes in the environment. By default, this script
requires at least '0.5' of the genome to be detected
in at least one metagenome. This parameter allows you
to change that. 0 would mean no detection test
required, 1 would mean the entire genome must be
detected.


## anvi-migrate-db

positional arguments: DATABASE Anvi'o database for migration

Usage

anvi-migrate-db [-h] [--just-do-it] [-t VERSION]
DATABASE [DATABASE ...]


Parameters

optional arguments:

  --just-do-it          Do not bother me with warnings
-t VERSION, --target-version VERSION
reaches to this version.


Takes an anvi'o linkmers report, generates an oligotyping output

Usage

anvi-oligotype-linkmers [-h] -i LINKMER_REPORT -o DIR_PATH


Parameters

optional arguments:

  -i LINKMER_REPORT, --input-file LINKMER_REPORT
Output file of anvi-report-linkmers.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files


## anvi-pan-genome

A DIAMOND and MCL-based anvi'o pangenome workflow. You provide genomes from anywhere (whether they are external genomes, or anvi'o genome bins in collections), and it gives you back a pangenome analysis.

Usage

anvi-pan-genome [-h] -g GENOMES_STORAGE [-G GENOME_NAMES]
[--skip-alignments] [--skip-homogeneity]
[--quick-homogeneity] [--align-with ALIGNER]
[--exclude-partial-gene-calls] [--use-ncbi-blast]
[--minbit MINBIT] [--mcl-inflation INFLATION]
[--min-occurrence NUM_OCCURRENCE]
[--min-percent-identity PERCENT] [--sensitive]
[-n PROJECT_NAME] [--description TEXT_FILE]
[--skip-hierarchical-clustering]
[--enforce-hierarchical-clustering]


Parameters

GENOMES: The very fancy genomes storage file. This file is generated by the program anvi-genomes-storage. Please see the online tutorial on pangenomic workflow if you don't know how to generate one.

  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
-G GENOME_NAMES, --genome-names GENOME_NAMES
Genome names to 'focus'. You can use this parameter to
limit the genomes included in your analysis. You can
provide these names as a commma-separated list of
names, or you can put them in a file, where you have a
single genome name in each line, and provide the file
path.


PARAMETERS: Important stuff Tom never pays attention (but you should).

  --skip-alignments     By default, anvi'o attempts to align amino acid
sequences in each gene cluster using multiple sequnce
alignment via muscle. You can use this flag to skip
that step and be upset later.
--skip-homogeneity    By default, anvi'o attempts to calculate homogeneity
values for every gene cluster, given that they are
aligned. You can use this flag to have anvi'o skip
homogeneity calculations. Anvi'o will ignore this flag
if you decide to skip alignments
--quick-homogeneity   By default, anvi'o will use a homogeneity algorithm
that checks for horizontal and vertical geometric
homogeneity (along with functional). With this flag,
you can tell anvi'o to skip horizontal geometric
homogeneity calculations. It will be less accurate but
quicker. Anvi'o will ignore this flag if you skip
homogeneity calculations or alignments all together.
--align-with ALIGNER  The multiple sequnce alignment program to use when
multiple seqeunce alignment is necessary. To see all
available optons, use the flag --list-aligners.
--exclude-partial-gene-calls
By default, anvi'o includes all partial gene calls
from the analysis, which, in some cases, may inflate
the number of gene clusters identified and introduce
extra heterogeneity within those gene clusters. Using
this flag, you can request anvi'o to exclude partial
gene calls from the analysis (whether a gene call is
partial or not is an information that comes directly
from the gene caller used to identify genes during the
generation of the contigs database).
--use-ncbi-blast      This program uses DIAMOND by default, however, if you
like, you can use good ol' blastp from NCBI instead.
--minbit MINBIT       The minimum minbit value. The minbit heuristic
provides a mean to set a to eliminate weak matches
between two amino acid sequences. We learned it from
ITEP (Benedict MN et al, doi:10.1186/1471-2164-15-8),
which is a comprehensive analysis workflow for
pangenomes, and decided to use it in the anvi'o
pangenomic workflow, as well. Briefly, If you have two
amino acid sequences, 'A' and 'B', the minbit is
defined as 'BITSCORE(A, B) / MIN(BITSCORE(A, A),
BITSCORE(B, B))'. So the minbit score between two
sequences goes to 1 if they are very similar over the
entire length of the 'shorter' amino acid sequence,
and goes to 0 if (1) they match over a very short
stretch compared even to the length of the shorter
amino acid sequence or (2) the match betwen sequence
identity is low. The default is 0.5.
--mcl-inflation INFLATION
MCL inflation parameter, that defines the sensitivity
of the algorithm during the identification of the gene
effect on cluster granularity is here:
(http://micans.org/mcl/man/mclfaq.html#faq7.2). The
default is 2.
--min-occurrence NUM_OCCURRENCE
Do you not want singletons?\ You don't? Well, this
doubletons, if you want). Anvi'o will remove gene
clusters that occur less than the number you set using
this parameter from the analysis. The default is 1,
which means everything will be kept. If you want to
remove singletons, set it to 2, if you want to remove
doubletons as well, set it to 3, and so on.
--min-percent-identity PERCENT
Minimum percent identity between the two amino acid
sequences for them to have an edge for MCL analysis.
This value will be used to filter hits from Diamond
search results. Because percent identity is not a
predictor of a good match (since it does not
communicate many other important factors such as the
alignment length between the two sequences and its
proportion to the entire length of those involved), we
suggest you rely on 'minbit' parameter. But you know
what? Maybe you shouldn't listen to anyone, and
experiment on your own! The default is 0 percent.
--sensitive           DIAMOND sensitivity. With this flag you can instruct
DIAMOND to be 'sensitive', rather than 'fast' during
the search. It is likely the search will take
remarkably longer. But, hey, if you are doing it for
your final analysis, maybe it should take longer and
be more accurate. This flag is only relevant if you
are running DIAMOND.


OTHERS: Sweet parameters of convenience.

  -n PROJECT_NAME, --project-name PROJECT_NAME
Name of the project. Please choose a short but
descriptive name (so anvi'o can use it whenever she
needs to name an output file, or add a new table in a
database, or name her first born).
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.


ORGANIZING GENE CLUSTERs: These are stuff that will change the clustering dendrogram of your gene clusters.

  --skip-hierarchical-clustering
Anvi'o attempts to generate a hierarchical clustering
of your gene clusters once it identifies them so you
can use anvi-display-pan to play with it. But if you
want to skip this step, this is your flag.
--enforce-hierarchical-clustering
If you want anvi'o to try to generate a hierarchical
clustering of your gene clusters even if the number of
gene clusters exceeds its suggested limit for
hierarchical clustering, you can use this flag to
enforce it. Are you are a rebel of some sorts? Or did
machine using this flag.
--distance DISTANCE_METRIC
The distance metric for the clustering of gene
clusters. If you do not use this flag, the default
distance metric will be used for each clustering
configuration which is "euclidean".
The same story with the --distance, except, the
system default for this one is ward.


## anvi-profile

Main entry point for Post-Assembly Metagenomics Pipeline

Usage

anvi-profile [-h] [-i INPUT_BAM] [-c CONTIGS_DB] [--blank-profile]
[-o DIR_PATH] [-W] [-S NAME] [--report-variability-full]
[--skip-SNV-profiling] [--profile-SCVs]
[--description TEXT_FILE] [--cluster-contigs]
[--skip-hierarchical-clustering]
[-M INT] [--max-contig-length INT] [-X INT] [-V INT]
[--list-contigs] [--contigs-of-interest FILE]
[--write-buffer-size INT]


Parameters

INPUTS: There are two possible inputs for anvio profiler. You must to declare either of these two.

  -i INPUT_BAM, --input-file INPUT_BAM
Sorted and indexed BAM file to analyze. Takes a long
time depending on the length of the file and
parameters used for profiling.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--blank-profile       If you only have contig sequences, but no mapping data
(i.e., you found a genome and would like to take a
look from it), this flag will become very hand. After
creating a contigs database for your contigs, you can
create a blank anvi'o profile database to use anvi'o
interactive interface with that contigs database
without any mapping data.


EXTRAS: Things that are not mandatory, but can be useful if/when declared.

  -o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-W, --overwrite-output-destinations
Overwrite if the output files and/or directories
exist.
-S NAME, --sample-name NAME
It is important to set a sample name (using only ASCII
letters and digits and without spaces) that is unique
(considering all others). If you do not provide one,
anvi'o will try to make up one for you based on other
information, although, you should never let the
software to decide these things).
--report-variability-full
One of the things anvi-profile does is to store
Usually it does not report every variable position,
since not every variable position is geniune
variation. Say, if you have 1,000 coverage, and all
nucleotides at that position are Ts and only one of
them is a C, the confidence of that C being a real
variation is quite low. anvio has a simple algorithm
in place to reduce the impact of noise. However, using
this flag you can diable it and ask profiler to report
every single variation (which may result in very large
output files and millions of reports, but you are the
boss). Do not forget to take a look at '--min-
coverage-for-variability' parameter
--skip-SNV-profiling  By default, anvi'o characterizes single-nucleotide
variation in each sample. The use of this flag will
instruct profiler to skip that step. Please remember
that parameters and flags must be identical between
different profiles using the same contigs database for
them to merge properly.
--profile-SCVs        Anvi'o can perform accurate characterization of codon
frequencies in genes during profiling. While having
codon frequencies opens doors to powerful evolutionary
insights in downstream analyses, due to its
computational complexity, this feature comes 'off' by
default. Using this flag you can rise against the
authority as you always should, and make anvi'o
profile codons.
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.


HIERARCHICAL CLUSTERING: Do you want your splits to be clustered? Yes? No? Maybe? Remember: By default, anvi-profile will not perform hierarchical clustering on your splits; but if you use --blank flag, it will try. You can skip that by using the --skip-hierarchical-clustering flag.

  --cluster-contigs     Single profiles are rarely used for genome binning or
visualization, and since clustering step increases the
profiling runtime for no good reason, the default
behavior is to not cluster contigs for individual
runs. However, if you are planning to do binning on
one sample, you must use this flag to tell anvio to
run cluster configurations for single runs on your
sample.
--skip-hierarchical-clustering
If you are not planning to use the interactive
interface (or if you have other means to add a tree of
contigs in the database) you may skip the step where
hierarchical clustering of your items are preformed
based on default clustering recipes matching to your
database type.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
Only relevant if you are using --cluster-contigs
flag. The default is "euclidean".
The linkage method for the hierarchical clustering.
Just like the distance metric this is only relevant if
you are using it with --cluster-contigs flag. The
default is "ward".


NUMBERS: Defaults of these parameters will impact your analysis. You can always come back to them and update your profiles, but it is important to make sure defaults are reasonable for your sample.

  -M INT, --min-contig-length INT
Minimum length of contigs in a BAM file to analyze.
The minimum length should be long enough for tetra-
nucleotide frequency analysis to be meaningful. There
is no way to define a golden number of minumum length
that would be applicable to genomes found in all
environments, but we chose the default to be 2500, and
have been happy with it. You are welcome to
experiment, but we advise to never go below 1,000. You
also should remember that the lower you go, the more
time it will take to analyze all contigs. You can use
--list-contigs parameter to have an idea how many
contigs would be discarded for a given M.
--max-contig-length INT
Just like the minimum contig length parameter, but to
set a maximum. Basically this will remove any contig
longer than a certain value. Why would anyone need
this? Who knows. But if you ever do, it is here.
-X INT, --min-mean-coverage INT
Minimum mean coverage for contigs to be kept in the
analysis. The default value is 0, which is for your
best interest if you are going to profile muptiple BAM
files which are then going to be merged for a cross-
sectional or time series analysis. Do not change it if
you are not sure this is what you want to do.
-V INT, --min-coverage-for-variability INT
Minimum coverage of a nucleotide position to be
subjected to SNV profiling. By default, anvio will not
attempt to make sense of variation in a given
nucleotide position if it is covered less than 10X.
You can change that minimum using this parameter.


CONTIGS: Sweet parameters of convenience

  --list-contigs        When declared, the program will list contigs in the
BAM file and exit gracefully without any further
analysis.
--contigs-of-interest FILE
It is possible to analyze only a group of contigs from
a given BAM file. If you provide a text file, in which
every contig of interest is listed line by line, the
profiler would engine only on those contigs in the BAM
file and ignore the rest. This can be used for
debugging purposes, or to engine on a particular group
of contigs that were identified as relevant during the
interactive analysis.


PERFORMANCE: Performance settings for profiler

  -T NUM_THREADS, --num-threads NUM_THREADS
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.
--queue-size INT      The queue size for worker threads to store data to
communicate to the main thread. The default is set by
the class based on the number of threads. If you have
*any* hesitation about whther you know what you are
doing, you should not change this value.
--write-buffer-size INT
How many items should be kept in memory before they
are written do the disk. The default is 500. The
larger the buffer size, the less frequent the program
consumed since the processed items will be cleared off
the memory only after they are written to the disk.
The default buffer size will likely work for most
cases, but if you have very large contigs, you may
need to decrease this value. Please keep an eye on the
memory usage output to make sure the memory use never
exceeds the size of the physical memory.


## anvi-push

Push stuff to an anvi'server

Usage

anvi-push [-h] --user USERNAME [--api-url API_URL] -n PROJECT_NAME
[-t NEWICK] [--items-order FLAT_FILE] [-f FASTA]
[-d VIEW_DATA] [-A ADDITIONAL_LAYERS] [-s STATE]
[--description TEXT_FILE] [--bins BINS_DATA]
[--bins-info BINS_INFO] [--delete-if-exists]


Parameters

  --user USERNAME       The user for an anvi'server.
--api-url API_URL     Anvi'server url


PROJECT DETAILS: What to send to the server

  -n PROJECT_NAME, --project-name PROJECT_NAME
Name of the project. Please choose a short but
descriptive name (so anvi'o can use it whenever she
needs to name an output file, or add a new table in a
database, or name her first born).
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure
--items-order FLAT_FILE
A flat file that contains the order of items you wish
the display using the interactive interface. You may
want to use this if you have a specific order of items
in your mind, and do not want to display a tree in the
middle (or simply you don't have one). The file format
is simple: each line should have an item name, and
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-d VIEW_DATA, --view-data VIEW_DATA
A TAB-delimited file for view data
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.
-s STATE, --state STATE
State file, you can export states from database using
anvi-export-state program
--description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.
--bins BINS_DATA      Tab-delimited file, first column contains tree leaves
(gene clusters, splits, contigs etc.) and second
column contains which Bin they belong.
--bins-info BINS_INFO
Additional information for bins. The file must contain
three TAB-delimited columns, where the first one must
be a unique bin name, the second should be a 'source',
and the last one should be a 7 character HTML color
code (i.e., '#424242'). Source column must contain
information about the origin of the bin. If these bins
are automatically identified by a program like
CONCOCT, this column could contain the program name
and version. The source information will be associated
with the bin in various interfaces so in a sense it is
not *that* critical what it says there, but on the
other hand it is, becuse we should also think about
people who may end up having to work with what we put
together later.


RISKY CLICKS: As the name suggests!

  --delete-if-exists    Be bold (at your own risk), and delete if exists.


## anvi-refine

Start the anvi'o interactive interactive for refining

Usage

anvi-refine [-h] -p PROFILE_DB -c CONTIGS_DB [-C COLLECTION_NAME]
[-b BIN_NAME] [-B FILE_PATH] [-V ADDITIONAL_VIEW]
[--taxonomic-level {t_phylum,t_class,t_order,t_family,t_genus,t_species}]
[--hide-outlier-SNVs] [--title NAME]
[--export-svg FILE_PATH] [--dry-run]
[--server-only]


Parameters

DEFAULT INPUTS: The interavtive interface can be started with and without anvi'o databases. The default use assumes you have your profile and contigs database, however, it is also possible to start the interface using ad-hoc input files. See 'MANUAL INPUT' section for other set of parameters that are mutually exclusive with datanases.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


REFINE-SPECIFICS: Parameters that are essential to the refinement process.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
-B FILE_PATH, --bin-ids-file FILE_PATH
Text file for bins (each line should be a unique bin
id).


  -V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
A TAB-delimited file for an additional view to be used
in the interface. This file file should contain all
split names, and values for each of them in all
samples. Each column in this file must correspond to a
sample name. Content of this file will be called
'user_vuew', which will be available as a new item in
the 'views' combo box in the interface
A TAB-delimited file for additional layers for splits.
The first column of this file must be split names, and
the remaining columns should be unique attributes. The
file does not need to contain all split names, or
values for each split in every column. Anvi'o will try
to deal with missing data nicely. Each column in this
file will be visualized as a new layer in the tree.


  --split-hmm-layers    When declared, this flag tells the interface to split
every gene found in HMM searches that were performed
against non-singlecopy gene HMM profiles into their
own layer. Please see the documentation for details.
--taxonomic-level {t_phylum,t_class,t_order,t_family,t_genus,t_species}
The taxonomic level to use. The default is 't_genus'.
Only relevant if the anvi'o ontigs database contains
taxonomic annotations.
--hide-outlier-SNVs   During profiling, anvi'o marks positions of single-
nucleotide variations (SNVs) that originate from
places in contigs where coverage values are a bit
'sketchy'. If you would like to avoid SNVs in those
positions of splits in applicable projects you can use
this flag, and the interafce would hide SNVs that are
marked as 'outlier' (although it is clearly the best
to see everything, no one will judge you if you end up
using this flag) (plus, there may or may not be some
historical data on this here:
https://github.com/meren/anvio/issues/309).
--title NAME          Title for the interface. If you are working with a
RUNINFO dict, the title will be determined based on
information stored in that file. Regardless, you can
override that value using this parameter. If you are
not using a anvio RUNINFO dictionary, a meaningful
title will appear in the interface only if you define
one using this parameter.
--export-svg FILE_PATH
The SVG output file path.


SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).

  --dry-run             Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--skip-init-functions
When declared, function calls for genes will not be
initialized (therefore will be missing from all
relevant interfaces or output files). The use of this
flag may reduce the memory fingerprint and processing
time for large datasets.
--skip-auto-ordering  When declared, the attempt to include automatically
generated orders of items based on additional data is
skipped. In case those buggers cause issues with your
data, and you still want to see your stuff and deal
with the other issue maybe later.


SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
(0.0.0.0) should work just fine for most.
-P INT, --port-number INT
Port number to use for anvi'o services. If nothing is
declared, anvi'o will try to find a suitable port
number, starting from the default port number, 8080.
--browser-path PATH   By default, anvi'o will use your default browser to
launch the interactive interface. If you would like to
use something else than your system default, you can
provide a full path for an alternative browser using
this parameter, and hope for the best. For instance we
are using this parameter to call Google's experimental
browser, Canary, which performs better with demanding
visualizations.
--read-only           When the interactive interface is started with this
flag, all 'database write' operations will be
disabled.
--server-only         The default behavior is to start the local server, and
fire up a browser that connects to the server. If you
have other plans, and want to start the server without
calling the browser, this is the flag you need.


## anvi-rename-bins

Rename all bins in a given collection (so they have pretty names).

Usage

anvi-rename-bins [-h] -c CONTIGS_DB -p PROFILE_DB
[--collection-to-write COLLECTION_TO_WRITE]
[--prefix PREFIX] [--report-file REPORT_FILE_PATH]
[--list-collections] [--dry-run] [--call-MAGs]
[--min-completion-for-MAG [0-100]]
[--max-redundancy-for-MAG [0-100]]
[--size-for-MAG MEGABASEPAIRS]


Parameters

DEFAULT INPUTS: Standard stuff

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
Collection name to read from. Anvi'o will not
overwrite an existing collection, instead, it will
create a copy of your collection with new bin names.
--collection-to-write COLLECTION_TO_WRITE
The new collection name. Give it a nice, fancy name.


OUTPUT AND TESTING: a.k.a, sweet parameters of convenience

  --prefix PREFIX       Prefix for the bin names. Must be a single word,
composed of digits and numbers. The use of the
underscore character is OK, but that's about it (fine,
the use of the dash character is OK, too but no
more!). If the prefix is 'PREFIX', each bin will be
renamed as 'PREFIX_XXX_00001, PREFIX_XXX_00002', and
so on, in the order of percent completion minus
percent redundancy (what we call, 'substantive
completion'). The 'XXX' part will either be 'Bin', or
'MAG depending on other parameters you use. Keep
--report-file REPORT_FILE_PATH
This file will report each name change event, so you
can trace back the original names of renamed bins
later.
--list-collections    Show available collections and exit.
--dry-run             When used does NOT update the profile database, just
creates the report file so you can view how things
will be renamed.


MAG OPTIONS: If you want to call some bins 'MAGs' because you are so cool

  --call-MAGs           This program by default rename your bins as
'PREFIX_Bin_00001', 'PREFIX_Bin_00002' and so on. If
you use this flag, it will name the ones that meet the
criteria described by MAG-related flags as
'PREFIX_MAG_00001', 'PREFIX_MAG_00002', and so on. The
ones that do not get to be named as MAGs will remain
as bins.
--min-completion-for-MAG [0-100]
If --call-MAGs flag is used, call any bin a 'MAG' if
their completion estimate is above this (the default
is 70), and the redundancy estimate is less than
--max-redundancy-for-MAG.
--max-redundancy-for-MAG [0-100]
If --call-MAGs flag is used, call any bin a 'MAG' if
their redundancy estimate is below this (the default
is 10) and the completion estimate is above --min-
completion-for-MAG.
--size-for-MAG MEGABASEPAIRS
If --call-MAGs flag is used, call any bin a 'MAG' if
their redundancy estimate is less than --max-
redundancy-for-MAG, AND THEIR SIZE IS LARGER THAN THIS
VALUE REGARDLESS OF THE COMPLETION ESTIMATE. The


Access reads in contigs and positions in a BAM file

Usage

anvi-report-linkmers [-h] -i INPUT_BAMS) [INPUT_BAM(S ...]
--contigs-and-positions CONTIGS_AND_POS
[--list-contigs]


Parameters

optional arguments:

  -i INPUT_BAM(S) [INPUT_BAM(S) ...], --input-files INPUT_BAM(S) [INPUT_BAM(S) ...]
Sorted and indexed BAM files to analyze. It is
essential that all BAM files must be the result of
mappings against the same contigs.
--contigs-and-positions CONTIGS_AND_POS
This is the file where you list the contigs, and
nucleotide positions you are interested in. This is
supposed to be a TAB-delimited file with two columns.
In each line, the first column should be the contig
name, and the second column should be the comma-
separated list of integers for nucleotide positions.
When declared, only reads that cover all positions
will be reported. It is necessary to use this flag if
you want to perform oligotyping-like analyses on
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
--list-contigs        When declared, the program will list contigs in the
BAM file and exit gracefully without any further
analysis.


## anvi-run-hmms

This program deals with populating tables that store HMM hits in an anvi'o contigs database.

Usage

anvi-run-hmms [-h] -c CONTIGS_DB [-H HMM PROFILE PATH]
[-I HMM PROFILE NAME] [-T NUM_THREADS]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-H HMM PROFILE PATH, --hmm-profile-dir HMM PROFILE PATH
You can use this parameter you can specify a directory
path that contain an HMM profile. This way you can run
HMM profiles that are not included in anvi'o. See the
online to find out about the specifics of this
directory structure .
-I HMM PROFILE NAME, --installed-hmm-profile HMM PROFILE NAME
When you run this program without any parameter, it
runs all 4 HMM profiles installed on your system. If
you want only a specific one to run, you can select it
by using this parameter. These are the currently
available ones: "Rinke_et_al" (type: singlecopy),
"Campbell_et_al" (type: singlecopy),
"BUSCO_83_Protista" (type: singlecopy),
"Ribosomal_RNAs" (type: Ribosomal_RNAs).
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.


## anvi-run-ncbi-cogs

Run NCBI COGs on stuff.

Usage

anvi-run-ncbi-cogs [-h] -c CONTIGS_DB [--cog-data-dir COG_DATA_DIR]
[--temporary-dir-path PATH]
[--search-with SEARCH_METHOD]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--cog-data-dir COG_DATA_DIR
The directory path for your COG setup. Anvi'o will try
to use the default path if you do not specify
anything.
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.
--sensitive           DIAMOND sensitivity. With this flag you can instruct
DIAMOND to be 'sensitive', rather than 'fast' during
the search. It is likely the search will take
remarkably longer. But, hey, if you are doing it for
your final analysis, maybe it should take longer and
be more accurate. This flag is only relevant if you
are running DIAMOND.
--temporary-dir-path PATH
If you don't provide anything here, this program will
come up with a temporary directory path by itself to
store intermediate files, and clean it later. If you
want to have full control over this, you can use this
flag to define one..
--search-with SEARCH_METHOD
What program to use for database searching. The
default search uses diamond. All available options
include: diamond, blastp.


## anvi-run-pfams

Run Pfam on Contigs Database.

Usage

anvi-run-pfams [-h] -c CONTIGS_DB [--pfam-data-dir PFAM_DATA_DIR]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--pfam-data-dir PFAM_DATA_DIR
The directory path for your Pfam setup. Anvi'o will
try to use the default path if you do not specify
anything.
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.


## anvi-run-workflow

optional arguments: -h, –help show this help message and exit

Usage

anvi-run-workflow [-h] [-w WORKFLOW]
[--get-default-config OUTPUT_FILENAME]
[--list-workflows] [--list-dependencies]
[-c CONFIG_FILE] [--dry-run] [--save-workflow-graph]
[-A ...]


Parameters

ESSENTIAL INPUTS: Things you must provide or this won't work

  -w WORKFLOW, --workflow WORKFLOW
You must specify a workflow name. To see a list of
available workflows run --list-workflows.


  --get-default-config OUTPUT_FILENAME
Store a json formatted config file with all the
default settings of the workflow. This is a good draft
you could use in order to write your own config file.
This config file contains all parameters that could be
configured for this workflow. NOTICE: the config file
is provided with default values only for parameters
that are set by us in the workflow. The values for the
rest of the parameters are determined by the relevant
program.
--list-workflows      Print a list of available snakemake workflows
--list-dependencies   Print a list of the dependencies of this workflow. You
must provide a workflow name and a config file.
snakemake will figure out which rules need to be run
according to your config file, and according to the
files available on your disk. According to the rules
that need to be run, we will let you know which
programs are going to be used, so that you can make
sure you have all of them installed and loaded.
-c CONFIG_FILE, --config-file CONFIG_FILE
TBD
--dry-run             Don't do anything real. Test everything, and stop
right before wherever the developer said 'well, this
is enough testing', and decided to print out results.
--save-workflow-graph
Save a graph representation of the workflow. If you
are using this flag and if your system is unable to
generate such graph outputs, you will hear anvi'o
complaining (still, totally worth trying).
snakemake. NOTICE: --additional-params HAS TO BE THE
LAST ARGUMENT THAT IS PASSED TO anvi-run-workflow,
ANYTHING THAT FOLLOWS WILL BE CONSIDERED AS PART OF
THE ADDITIONAL PARAMETERS THAT ARE PASSED TO
SNAKEMAKE. Any parameter that is accepted by snakemake
should be fair game here, but it is your
responsibility to make sure that whatever you added
makes sense. To see what parameters are available
please refer to the snakemake documentation. For
example, you could use this to set up cluster
CLUSTER-SUBMISSION-CMD"


## anvi-saavs-and-protein-structures-summary

Generate a static web site for SAAVs and protein structures.

Usage

anvi-saavs-and-protein-structures-summary [-h] [-c CONTIGS_DB]
[--genes GENES]
[--samples SAMPLES] -i
DIR_PATH -o DIR_PATH
[--perspectives PERSPECTIVES]


Parameters

CONTIGS DB: If you provide a contigs database, anvi'o will findout about functions and other properties of genes using the contigs database. This is supposed to be the contigs database you used to generate variability profile for this project like 2 years ago. Yeah. Time goes by :/

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


WHAT SHOULD BE PROCESSED: By default, anvi'o will learn about the genes and samples you have from the input data directory. If you want to overwrite that information (i.e. to work with a smaller set of genes or samples), you can come up with your own files.

  --genes GENES         Genes file.
--samples SAMPLES     Samples file.


INPUT/OUTPUT: Read from here, write to there.

  -i DIR_PATH, --input-dir DIR_PATH
Directory path for input files
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files


  --soft-link-images    By default, your imaeges will be copied in the output
directory to create a fully self-contained output (so
you can send it to your colleagues and they would have
everything they need to browse the output).
Alternatively you can use this flag to avoid copying
images in the output directory (it would make the
output less portable, but it would take less time and
space to generate it).
--perspectives PERSPECTIVES
By default anvi'o will use each perspective found in
the data directory to create an HTML output. Using
this parameter you can limit perspectives to the ones
you are interested in by defining them as a commma-
separated list. If you make a mistake anvi'o will tell
you what are the available perspectives, so don't
worry.


## anvi-search-functions

Search functions in an anvi'o contigs database or genomes storage. Basically, this program searches for one or more search terms you define in functional annotations of genes in an anvi'o contigs database, and generates multiple reports. The simpler report (which also is the default one) simply tells you which contigs contain genes with functions matching to serach terms you used. This file is only useful to quickly highlight matching contigs in the interface by providing it to the anvi-interactive with the --additional- layer parameter. You can also request a much more comprehensive report, which gives you anything you might need to know, including the matching gene caller id, functional annotation source, and full function name for each hit and serach term.

Usage

anvi-search-functions [-h] [-c CONTIGS_DB] [-p PAN_DB]
[-g GENOMES_STORAGE] --search-terms SEARCH_TERMS
[--delimiter CHAR]
[--annotation-sources SOURCE NAME[S]] [-l]
[-o FILE_PATH] [--full-report FILE_NAME]
[--include-sequences] [--verbose]


Parameters

SEARCH IN: Relevant source databases

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file


SEARCH FOR: Relevant terms

  --search-terms SEARCH_TERMS
Search terms. Multiple of them can be declared
separated by a delimiter (the default is a comma).
--delimiter CHAR      The delimiter to parse multiple input terms. The
default is ','.
--annotation-sources SOURCE NAME[S]
Get functional annotations for a specific list of
annotation sources. You can specifiy one or more
sources by separating them from each other with a
comma character (i.e., '--annotation-sources
source_1,source_2,source_3'). The default behavior is
to return everything
-l, --list-annotation-sources
List available functional annotation sources.


REPORT: Anvi'o can report the hits in multiple ways. The output file will be a very simple 2-column TAB-delimited output that is compatible with anvi'o additional data format (so you can give it to the anvi-interactive to see which splits contained genes that were matching to your search terms). You can also ask anvi'o to generate a full-report, that contains much more and much helpful information about each hit. Optionally you can even ask the gene seqeunces to appear in this report.

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
--full-report FILE_NAME
Optional output file with a fuller description of
findings.
--include-sequences   Include sequences in the report.
--verbose             Be verbose, print more messages whenever possible.


## anvi-self-test

A script for anvi'o to test itself

Usage

anvi-self-test [-h] [--suite SUITE]


Parameters

optional arguments:

  --suite SUITE  Suite of tests to execute. By default this program will
execute a full suite of example anvi'o commands to ensure
developers could think of. Alternatively you can choose a
specific test to run. Here is a full list of available
options: mini, full, pangenomics, alons-classifier.


## anvi-setup-ncbi-cogs

Usage

anvi-setup-ncbi-cogs [-h] [--cog-data-dir COG_DATA_DIR] [--reset]


Parameters

optional arguments:

  --cog-data-dir COG_DATA_DIR
The directory for COG data to be stored. If you leave
it as is without specifying anything, the default
destination for the data directory will be used to set
things up. The advantage of it is that everyone will
be using a single data directory, but then you may
need superuser privileges to do it. Using this
parameter you can choose the location of the data
directory somewhere you like. However, when it is time
to run COGs, you will need to remember that path and
provide it to the program.
--reset               This program by default attempts to use previously
are any. If something is wrong for some reason you can
use this to tell anvi'o to remove everything, and
start over.
--just-do-it          Don't bother me with questions or warnings, just do
it.
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.


## anvi-setup-pfams

Usage

anvi-setup-pfams [-h] [--pfam-data-dir PFAM_DATA_DIR] [--reset]


Parameters

optional arguments:

  --pfam-data-dir PFAM_DATA_DIR
The directory for Pfam data to be stored. If you leave
it as is without specifying anything, the default
destination for the data directory will be used to set
things up. The advantage of it is that everyone will
be using a single data directory, but then you may
need superuser privileges to do it. Using this
parameter you can choose the location of the data
directory somewhere you like. However, when it is time
to run Pfam, you will need to remember that path and
provide it to the program.
--reset               This program by default attempts to use previously
are any. If something is wrong for some reason you can
use this to tell anvi'o to remove everything, and
start over.


## anvi-show-collections-and-bins

A script to display collections stored in an anvi'o profile or pan database.

Usage

anvi-show-collections-and-bins [-h] -p PAN_OR_PROFILE_DB


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database


## anvi-show-misc-data

Show all misc data keys in all misc data tables

Usage

anvi-show-misc-data [-h] -p PAN_OR_PROFILE_DB [-t NAME] [-D NAME]


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-t NAME, --target-data-table NAME
The target table is the table you are interested in
accessing. Currently it can be 'items','layers', or
'layer_orders'. Please see most up-to-date online
-D NAME, --target-data-group NAME
Data group to focus. Anvi'o misc data tables support
associating a set of data keys with a data group. If
you have no idea what this is, then probably you don't
need it, and anvi'o will take care of you. Note: this
flag is IRRELEVANT if you are working with additioanl
order data tables.


## anvi-split

Split an anvi'o profile into smaller profiles. This is usually great when you want to share a subset of an anvi'o profile. You give this guy an anvi'o profile databsae, a contigs database, and a collection id, and it gives you back directories of profiles for each bin that can be treated as individual anvi'o profiles.

Usage

anvi-split [-h] -p PROFILE_DB -c CONTIGS_DB [-C COLLECTION_NAME]
[-b BIN_NAME] [-o DIR_PATH] [--list-collections]
[--skip-hierarchical-clustering] [--skip-variability-tables]
[--compress-auxiliary-data]
[--enforce-hierarchical-clustering]


Parameters

DATABASES: Declaring relevant anvi'o databases. First things first.

  -p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


COLLECTION: You should provide a valid collection name. If you do not provide bin names, the program will generate an output for each bin in your collection separately.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.


OUTPUT: Where do we want the resulting split profiles to be stored.

  -o DIR_PATH, --output-dir DIR_PATH
Directory path for output files


EXTRAS: Stuff that you rarely need, but you really really need when the time comes. Following parameters will aply to each of the resulting anvi'o profile that will be split from the mother anvi'o profile.

  --list-collections    Show available collections and exit.
--skip-hierarchical-clustering
If you are not planning to use the interactive
interface (or if you have other means to add a tree of
contigs in the database) you may skip the step where
hierarchical clustering of your items are preformed
based on default clustering recipes matching to your
database type.
--skip-variability-tables
Processing variability tables in profile databse might
take a very very long time. With this flag you will be
--compress-auxiliary-data
When declared, the auxiliary data file in the
resulting output will be compressed. This saves space,
but it takes long. Also, if you are planning to
compress the entire later using GZIP, it is even
useless to do. But you are the boss!
--enforce-hierarchical-clustering
If you have more than 25,000 splits in your merged
profile, anvi-merge will automatically skip the
hierarchical clustering of splits (by setting --skip-
hierarchical-clustering flag on). This is due to the
fact that computational time required for hierarchical
clustering increases exponentially with the number of
items being clustered. Based on our experience we
decided that 25,000 splits is about the maximum we
should try. However, this is not a theoretical limit,
and you can overwrite this heuristic by using this
flag, which would tell anvi'o to attempt to cluster
splits regardless.
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
If you do not use this flag, the default distance
metric will be used for each clustering configuration
which is "euclidean".
The same story with the --distance, except, the
system default for this one is ward.


## anvi-summarize

Summarize an anvi'o collection. Fun stuff.

Usage

anvi-summarize [-h] -p PAN_OR_PROFILE_DB [-c CONTIGS_DB]
[-g GENOMES_STORAGE] [--init-gene-coverages]
[-C COLLECTION_NAME] [-o DIR_PATH] [--list-collections]
[--taxonomic-level {t_phylum,t_class,t_order,t_family,t_genus,t_species}]
[--cog-data-dir COG_DATA_DIR] [--quick-summary]
[--just-do-it] [--report-aa-seqs-for-gene-calls]


Parameters

PROFILE: The profile. It could be a standard or pan profile database.

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database


PROFILE TYPE SPECIFIC PARAMETERS: If you are summarizing a collection stored in a standard anvi'o profile, you will need a contigs database to go with it. If you are working with a pan profile, then you will need to provide a genomes storage. Don't worry too much, because anvi'o will warn you gently if you make a mistake.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file


STANDARD PROFILE SPECIFIC PARAMS: Parameters that are only relevant to standard profile summaries (declaring or not declaring them will not change anything if you are summarizing a pan profile).

  --init-gene-coverages
Initialize gene coverage and detection data. This is a
very computationally expensive step, but it is
necessary when you need gene level coverage data. The
reason this is very computationally expensive is
because anvi'o computes gene coverages by going back
to actual coverage values of each gene to average
them, instead of using contig average coverage values,
for extreme accuracy.
--report-aa-seqs-for-gene-calls
You can use this flag if you would like to find
translated DNA sequences for your gene calls in the
genes output file.


COMMONS: Common parameters for both pan and standard profile summaries.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
--list-collections    Show available collections and exit.
--taxonomic-level {t_phylum,t_class,t_order,t_family,t_genus,t_species}
The taxonomic level to use. The default is 't_genus'.
Only relevant if the anvi'o ontigs database contains
taxonomic annotations.
--cog-data-dir COG_DATA_DIR
The directory path for your COG setup. Anvi'o will try
to use the default path if you do not specify
anything.
--quick-summary       When declared the summary output will be generated as
quickly as possible, with minimum amount of essential


EXTRA: Extra stuff because you're extra.

  --just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-update-db-description

Update the description in an anvi'o database

Usage

anvi-update-db-description [-h] --description TEXT_FILE DB


Parameters

positional arguments:

  DB                    An anvi'o database.


optional arguments:

  --description TEXT_FILE
A plain text file that contains some description about
the project. You can use Markdwon syntax. The
description text will be rendered and shown in all
relevant interfaces, including the anvi'o interactive
interface, or anvi'o summary outputs.


## anvi-update-genes-in-structure-database

Add or remove genes from an already existing structure database. All settings used to generate your database will be used in this program.

Usage

anvi-update-genes-in-structure-database [-h] -c CONTIGS_DB -s
STRUCTURE_DB
[--genes-to-remove GENES_TO_REMOVE]
[--genes-to-remove-file GENES_TO_REMOVE_FILE]
[--dump-dir DUMP_DIR]
[--modeller-executable MODELLER_EXECUTABLE]


Parameters

DATABASES: Declaring relevant anvi'o databases. First things first.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-s STRUCTURE_DB, --structure-db STRUCTURE_DB
Anvi'o structure database.


GENES: Specifying which genes you want to be modelled.

  --genes-to-remove GENES_TO_REMOVE, -r GENES_TO_REMOVE
Gene caller ids to remove from your structure
database. Multiple of them can be declared by
separating with comma (e.g. --genes-to-remove
2,4,5,6).
--genes-to-remove-file GENES_TO_REMOVE_FILE, -R GENES_TO_REMOVE_FILE
A file of gene caller ids to remove from your
structure database. Each line in the file should be a
gene caller id.
Gene caller ids to remove from your structure
database. Multiple of them can be declared by
separating with comma (e.g. --genes-to-add 2,4,5,6).
A file of gene caller ids to remove from your
structure database. Each line in the file should be a
gene caller id.


OUTPUT: Output file and output style.

  --dump-dir DUMP_DIR   Modelling and annotating structures requires a lot of
moving parts, each which have their own outputs. The
output of this program is a structure database
containing the pertinent results of this computation,
however a lot of stuff doesn't make the cut. By
providing a directory for this parameter you will get,
in addition to the structure database, a directory
containing the raw output for everything.


MODELLER EXECUTABLE: Which executable program to use for MODELLER, e.g. mod9.19

  --modeller-executable MODELLER_EXECUTABLE
The MODELLER program to use. For example, mod9.19.
The default is mod9.19


MISCALLANEOUS: Other stuff

  --skip-genes-if-already-present
instead of complaining it will be skipped.


A script to add a 'DEFAULT' collection in an anvi'o pan or profile database with a bin named 'EVERYTHING' that describes all items available in the profile databse.

Usage

anvi-script-add-default-collection [-h] -p PAN_OR_PROFILE_DB
[-c CONTIGS_DB]


Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
Anvi'o pan or profile database
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'


## anvi-script-calculate-pn-ps-ratio

Extract information for variable positions

Usage

anvi-script-calculate-pn-ps-ratio [-h] [-a SAAV_TABLE] [-b SCV_TABLE]
-c CONTIGS_DB [-j FLOAT]
[-i MINIMUM_NUM_VARIANTS]
[-m MIN_COVERAGE] -o DIR_PATH


Parameters

VARIABILITY: Two variability tables generated from anvi-gen-variability-table are required. One of SAAVs (generated with –engine AA) and one of SCVs (generated with –engine CDN). They must be generated with the same profile database and the exact same set of genes in the contigs database. To be safe, it is highly recommended you use the same settings during both commands except for changing –engine AA to –engine CDN and the output filename.

  -a SAAV_TABLE, --saav-table SAAV_TABLE
Filepath to the SAAV table.
-b SCV_TABLE, --scv-table SCV_TABLE
Filepath to the SCV table.
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Filepath to the contigs database used to generate
variability tables.


OUTPUT: The output of this program is a folder directory with several tables.

  -o DIR_PATH, --output-dir DIR_PATH
Directory path for output files


TUNABLES: Successfully tune one or more of these parameters to unlock the "Advanced anvi'an" achievement.

  -j FLOAT, --min-departure-from-consensus FLOAT
Variants (either SCVs or SAAVs) will be ignored if
they have a departure from consensus less than this
value. Note: Keep in mind you may have already
supplied this parameter during anvi-gen-variability-
profile. Default is 0.1.
-i MINIMUM_NUM_VARIANTS, --minimum-num-variants MINIMUM_NUM_VARIANTS
Ignore genes with less than this number of single
codon variants. This avoids being impressed by pN/pS
values of infinite, when really all that happened was
a gene had 1 SAAV and 0 synonymous SCVs. The default
is 4 to ensure some level of statistical importance.
-m MIN_COVERAGE, --min-coverage MIN_COVERAGE
If the coverage value at a codon is less than this
amount, any SAAVs or SCVs associated with it will be
ignored.


## anvi-script-checkm-tree-to-interactive

Reformat FASTA file (remove contigs based on length, or based on a given list of deflines, and/or generate an output with simpler names)

Usage

anvi-script-checkm-tree-to-interactive [-h] -t CHECKM TREE -o DIRECTORY


Parameters

optional arguments:

  -t CHECKM TREE, --tree CHECKM TREE
Tree file generated by CheckM.
-o DIRECTORY, --output-dir DIRECTORY
The directory name that output files will be stored.


## anvi-script-compute-ani-for-fasta

Run ANI between contigs in a single FASTA file.

Usage

anvi-script-compute-ani-for-fasta [-h] -f FASTA -o DIR_PATH [-p PAN_DB]
[--log-file FILE_PATH]
[--method {ANIm,ANIb,ANIblastall,TETRA}]
[--distance DISTANCE_METRIC]
[--just-do-it]


Parameters

optional arguments:

  -f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.
--log-file FILE_PATH  File path to store debug/output messages.
--method {ANIm,ANIb,ANIblastall,TETRA}
Method for pyANI. The default is ANIb. You must have
the necessary binary in path for whichever method you
choose. According to the pyANI help for v0.2.7 at
https://github.com/widdowquinn/pyani, the method
'ANIm' uses MUMmer (NUCmer) to align the input
sequences. 'ANIb' uses BLASTN+ to align 1020nt
fragments of the input sequences. 'ANIblastall': uses
the legacy BLASTN to align 1020nt fragments Finally,
'TETRA': calculates tetranucleotide frequencies of
each input sequence
--distance DISTANCE_METRIC
The distance metric for the hierarchical clustering.
The default is "euclidean".
The linkage method for the hierarchical clustering.
The default is "ward".
--just-do-it          Don't bother me with questions or warnings, just do
it.


## anvi-script-filter-fasta-by-blast

Filter FASTA file according to BLAST table (remove sequences with bad BLAST alignment.

Usage

anvi-script-filter-fasta-by-blast [-h] [-f FASTA] [-o FILE_PATH] -b
BLAST_OUTPUT -s OUTFMT -t THRESHOLD


Parameters

optional arguments:

  -f FASTA, --fasta-file FASTA
A FASTA-formatted input file
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-b BLAST_OUTPUT, --blast-output BLAST_OUTPUT
BLAST table generated with blastp. --outfmt 6 as the
output format is assumed.
-s OUTFMT, --outfmt OUTFMT
Specify the column ordering of your BLAST report. We
add the following paramter to our BLAST searches so
the output report contains the qlen field, which is
not included by default: -outfmt '6 qseqid sseqid
pident length mismatch gapopen qstart qend sstart send
evalue bitscore qlen slen'. You may have used a
different -outfmt paramter, and you should use this
parameter to explicitly define the column names in
parameter mentioned above, then the correct version of
this parameter would be: "qseqid sseqid pident length
mismatch gapopen qstart qend sstart send evalue
bitscore qlen slen". Regardless of the BLAST output
format, your columns MUST contain the following
parameters for this program to work properly:
'qseqid', 'bitscore', 'length', 'qlen', and 'pident'.
-t THRESHOLD, --threshold THRESHOLD
What proper_pident threshold do you want to use for
filtering out sequences whose top bit-score matches
have proper_pidents less than this threshold? We
have defined proper_pident to be the percentage of
the query amino acids that both aligned to and were
identical to the corresponding matched amino acid.
Note that the pident parameter output by BLAST does
not include regions of the query sequence unaligned to
the matched sequence, whereas proper_pident does.
For example, a sequence that's only half aligned by a
match but with 100% identity at matched regions has a
pident of 100 but a proper_pident of 50. The
default is 30.0%.


## anvi-script-gen-CPR-classifier

Train a classifier for CPR prediction

Usage

anvi-script-gen-CPR-classifier [-h] [-o OUTPUT] matrix


Parameters

positional arguments:

  matrix                TAB-delimited matrix of CPR genome names, classes, and
presence absence of single-copy genes. Headers of the
first two rows should be "genome", and "class". The
rest of the rows shold be single-copy genes.


optional arguments:

  -o OUTPUT, --output OUTPUT
Output file name for the classifier.


## anvi-script-gen-distribution-of-genes-in-a-bin

Quantify the detection of genes in genomes in metagenomes to identify the environmental core. This is a helper script for anvi'o metapangenomic workflow.

Usage

anvi-script-gen-distribution-of-genes-in-a-bin [-h] -c CONTIGS_DB
[-p PROFILE_DB]
[-C COLLECTION_NAME]
[-b BIN_NAME]
[--min-detection FLOAT]
[--fraction-of-median-coverage FLOAT]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
-b BIN_NAME, --bin-id BIN_NAME
Bin name you are interested in.
--min-detection FLOAT
For this entire thing to work, the genome you are
focusing on should be detected in at least one
metagenome. If that is not the case, it would mean
that you do not have any sample that represents the
niche for this organism (or you do not have enough
depth of coverage) to investigate the detection of
genes in the environment. By default, this script
requires at least '0.5' of the genome to be detected
in at least one metagenome. This parameter allows you
to change that. 0 would mean no detection test
required, 1 would mean the entire genome must be
detected.
--fraction-of-median-coverage FLOAT
The value set here will be used to remove a gene if
its total coverage across environments is less than
the median coverage of all genes multiplied by this
value. The default is 0.25, which means, if the median
total coverage of all genes across all samples is
100X, then, a gene with a total coverage of less than
25X across all samples will be assumed not a part of
the 'environmental core'.


## anvi-script-gen-hmm-hits-matrix-across-genomes

A simple script to generate a TAB-delimited file for the presence or absence of HMM hits in a given set of contigs databases and an HMM source.

Usage

anvi-script-gen-hmm-hits-matrix-across-genomes [-h] [--list-sources]
[--source SOURCE] -o
FILE_PATH
CONTIG DATABASES)
[CONTIG DATABASE(S ...]


Parameters

positional arguments:

  CONTIG DATABASE(S)    One or more anvi'o contigs databases.


optional arguments:

  --list-sources        Show available single-copy gene sources and exit.
--source SOURCE       A single HMM source to focus on. This HMM source
should be found in all contigs databases.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-script-gen-programs-network

Generate a network of anvi'o programs

Usage

anvi-script-gen-programs-network [-h] [-o FILE_PATH]
[-p PROGRAM_NAMES_TO_FOCUS]


Parameters

optional arguments:

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
-p PROGRAM_NAMES_TO_FOCUS, --program-names-to-focus PROGRAM_NAMES_TO_FOCUS
Comma-spearated list of program names to focus Mostly
for debugging purposes.


## anvi-script-gen-programs-vignette

Generate a vignette for anvi'o programs

Usage

anvi-script-gen-programs-vignette [-h] [-o FILE_PATH]
[-p PROGRAM_NAMES_TO_FOCUS]


Parameters

optional arguments:

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.
-p PROGRAM_NAMES_TO_FOCUS, --program-names-to-focus PROGRAM_NAMES_TO_FOCUS
Comma-spearated list of program names to focus Mostly
for debugging purposes.


Usage

anvi-script-gen-short-reads [-h] [--output-file-path FILE_PATH]
CONFIG_FILE


Parameters

positional arguments:

  CONFIG_FILE           Configuration file


optional arguments:

  --output-file-path FILE_PATH
Output FASTA file path


## anvi-script-gen_stats_for_single_copy_genes.py

A simple script to generate info from search tables

Usage

anvi-script-gen_stats_for_single_copy_genes.py [-h] [--list-sources]
[--source SOURCE]
CONTIGS_DB


Parameters

positional arguments:

  CONTIGS_DB       Contigs database to read from.


optional arguments:

  --list-sources   Show available single-copy gene search results and exit.
--source SOURCE  Source to focus on. If none declared, all single-copy gene
sources are going to be listed.


## anvi-script-genbank-to-external-gene-calls

This script takes a genbank file and converts it into the format required for importing external gene calls and functions into anvi'o.

Usage

anvi-script-genbank-to-external-gene-calls [-h] -i INPUT_GB
[-s ANNOTATION_SOURCE]
[-v ANNOTATION_VERSION]
[-o OUTPUT_GENE_CALLS_TSV]
[-a OUTPUT_FUNCTIONS_TSV]
[-f OUTPUT_FASTA]


Parameters

optional arguments:

  -i INPUT_GB, --input_gb INPUT_GB
input Genbank file (e.g. typically"*.gbk", "*.gb",
"*.gbff")
-s ANNOTATION_SOURCE, --annotation_source ANNOTATION_SOURCE
annotation source (default: "NCBI_PGAP")
-v ANNOTATION_VERSION, --annotation_version ANNOTATION_VERSION
Annotation source version (default: "v4.6")
-o OUTPUT_GENE_CALLS_TSV, --output_gene_calls_tsv OUTPUT_GENE_CALLS_TSV
Output tsv file (default: "external_gene_calls.tsv")
-a OUTPUT_FUNCTIONS_TSV, --output_functions_tsv OUTPUT_FUNCTIONS_TSV
Output functions file (default: "functions.tsv")
-f OUTPUT_FASTA, --output_fasta OUTPUT_FASTA
Output fasta file with matching, simplified headers to
be ready for anvi-gen-contigs-db (default:
"clean.fa")


## anvi-script-get-collection-info

Provides information about each bin in a given collection.

Usage

anvi-script-get-collection-info [-h] -c CONTIGS_DB [-p PROFILE_DB]
[-C COLLECTION_NAME]
[--list-collections] [-o FILE_PATH]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--list-collections    Show available collections and exit.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-script-merge-collections

Generate an additional data file from multiple collections.

Usage

anvi-script-merge-collections [-h] -c CONTIGS_DB -i FILES) [FILE(S ...]
-o OUTPUT_FILE


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
Input file(s). TAB-delimited input files should have
two columns, where the first column holds the contig
name, and the second one the bin id. This is the
standard ouptut of the program anvi-export-collection.
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Output file name.


## anvi-script-predict-CPR-genomes

Screen for genomes to find likely members of CPR

Usage

anvi-script-predict-CPR-genomes [-h] -c CONTIGS_DB [-p PROFILE_DB]
[-C COLLECTION_NAME]
[--list-collections]
[--report-only-cpr]
[--min-genome-size MIN_GENOME_SIZE]
[--min-percent-completion MIN_PERCENT_COMPLETION]
[--max-percent-redundancy MAX_PERCENT_REDUNDANCY]
[--min-class-probability MIN_CLASS_PROBABILITY]
[-o FILE_PATH]
classifier_object


Parameters

positional arguments:

  classifier_object     Model output generated by anvi-script-gen-CPR-
classifier


optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
-C COLLECTION_NAME, --collection-name COLLECTION_NAME
Collection name.
--list-collections    Show available collections and exit.
--report-only-cpr     Include only bins that look like CPR genomes.
--min-genome-size MIN_GENOME_SIZE
Minimum genome size to consider for CPR in Mbp.
Default is 0.500000
--min-percent-completion MIN_PERCENT_COMPLETION
Minimum percent completion estimate based on anvi'o
default single-copy gene collections. Default is 50
--max-percent-redundancy MAX_PERCENT_REDUNDANCY
Maxumum percent redundancy or single-copy genes in an
anvi'o bin, or a genome to consider for
classification. The default is 30
--min-class-probability MIN_CLASS_PROBABILITY
If the classification confidence is below this don't
bother. Default is 75.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.


## anvi-script-reformat-fasta

Reformat FASTA file (remove contigs based on length, or based on a given list of deflines, and/or generate an output with simpler names)

Usage

anvi-script-reformat-fasta [-h] [-l MIN_LENGTH] [-i TXT FILE]
[-I TXT FILE] -o FASTA FILE
[--simplify-names] [--prefix PREFIX]
[-r REPORT FILE]
contigs_fasta


Parameters

positional arguments:

  contigs_fasta


optional arguments:

  -l MIN_LENGTH, --min-len MIN_LENGTH
Minimum length of contigs to keep (contigs shorter
than this value will not be included in the output
file). The default is 0, so nothing will be removed if
you do not declare a minimum size.
-i TXT FILE, --exclude-ids TXT FILE
IDs to remove from the FASTA file. You cannot provide
both --keep-ids and --exclude-ids.
-I TXT FILE, --keep-ids TXT FILE
If provided, all IDs not in this file will be excluded
from the reformatted FASTA file. Any additional
filters (such as --min-len) will still be applied to
the IDs in this file. You cannot provide both
--exclude-ids and --keep-ids.
-o FASTA FILE, --output-file FASTA FILE
Output file path.
--simplify-names      Edit deflines to make sure they contigs have simple
names.
--prefix PREFIX       Use this parameter if you would like to add a prefix
to your contig names while simplifying them. The
prefix must be a single word (you can use underscor
character, but nothing more!).
-r REPORT FILE, --report-file REPORT FILE
Report file path. When you run this program with
--simplify-names flag, all changes to deflines will
be reported in this file in case you need to go back
to this information later. It is not mandatory to
declare one, but it is a very good idea to have it.


## anvi-script-run-eggnog-mapper

Run eggnog-mapper on a contigs database, and store results

Usage

anvi-script-run-eggnog-mapper [-h] -c CONTIGS_DB
[--cog-data-dir COG_DATA_DIR]
[--drop-previous-annotations]
[--annotation EMAPPER_ANNOTATION_FILE]
[--use-version EMAPPER_VERSION]


Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
--cog-data-dir COG_DATA_DIR
The directory path for your COG setup if you did not
use the default directory.
whenever possible. Very conservatively, the default is
1. It is a good idea to not exceed the number of CPUs
this option if you are running your commands on a SGE
multiple threads to use, you may deplete your
resources very fast.
--drop-previous-annotations
When declared, previous annotations in the database
will be dropped.
--annotation EMAPPER_ANNOTATION_FILE
If you have an annotation file from a previous run,
you can call this program to import the contents of
that file into the database instead of a run from
scratch. In that case, you must also use the --use-
version parameter to clarify which parser version
should be used to parse it.
--use-version EMAPPER_VERSION
The version of eggnog-mapper that generated the
annotation file.


## anvi-script-snvs-to-interactive

Take the output of anvi-gen-variability-profile, prepare an output for interactive interface

Usage

anvi-script-snvs-to-interactive [-h]
[--min-departure-from-consensus FLOAT]
[--max-departure-from-consensus FLOAT]
-o DIR_PATH
profile


Parameters

positional arguments:

  profile               The output file generated by anvi-gen-variability-
profile


optional arguments:

  --min-departure-from-consensus FLOAT
Minimum departure from consensus at a given variable
nucleotide position. The default is 0.00.
--max-departure-from-consensus FLOAT
Maximum departure from consensus at a given variable
nucleotide position. The default is 0.99.
-o DIR_PATH, --output-dir DIR_PATH
Directory path for output files


## anvi-script-transpose-matrix

Transpose a TAB-delimited file

Usage

anvi-script-transpose-matrix [-h] -o FILE_PATH input_file


Parameters

positional arguments:

  input_file            Input matrix.


optional arguments:

  -o FILE_PATH, --output-file FILE_PATH
File path to store results.


