Microbial 'omics


Brought to you by

Questions? Concerns? Find us on

The latest version of anvi’o is v7.1. See the release notes.

Here you will find the current anvi’o programs in the latest stable version of the platform, and their help menu. The contents of this file was last updated on 19 Oct 21 20:59:06, and then anvi’o looked like this:

Key Value
Anvi'o version hope (v7-dev)
Profile DB version 38
Contigs DB version 20
Genes DB version 6
Auxiliary data storage version 2
Pan DB version 15
Genome data storage version 7
Structure DB version 2
KEGG Modules DB version 2
tRNA-seq DB version 2

Summary

Main anvi’o programs (116) anvi-analyze-synteny, anvi-compute-completeness, anvi-compute-functional-enrichment-across-genomes, anvi-compute-functional-enrichment-in-pan, anvi-compute-gene-cluster-homogeneity, anvi-compute-genome-similarity, anvi-compute-metabolic-enrichment, anvi-db-info, anvi-delete-collection, anvi-delete-functions, anvi-delete-hmms, anvi-delete-misc-data, anvi-delete-state, anvi-dereplicate-genomes, anvi-display-contigs-stats, anvi-display-functions, anvi-display-metabolism, anvi-display-pan, anvi-display-structure, anvi-estimate-genome-completeness, anvi-estimate-metabolism, anvi-estimate-scg-taxonomy, anvi-estimate-trna-taxonomy, anvi-experimental-organization, anvi-export-collection, anvi-export-contigs, anvi-export-functions, anvi-export-gene-calls, anvi-export-gene-coverage-and-detection, anvi-export-items-order, anvi-export-locus, anvi-export-misc-data, anvi-export-splits-and-coverages, anvi-export-splits-taxonomy, anvi-export-state, anvi-export-structures, anvi-export-table, anvi-gen-contigs-database, anvi-gen-fixation-index-matrix, anvi-gen-gene-consensus-sequences, anvi-gen-gene-level-stats-databases, anvi-gen-genomes-storage, anvi-gen-network, anvi-gen-phylogenomic-tree, anvi-gen-structure-database, anvi-gen-variability-matrix, anvi-gen-variability-network, anvi-gen-variability-profile, anvi-get-aa-counts, anvi-get-codon-frequencies, anvi-get-pn-ps-ratio, anvi-get-sequences-for-gene-calls, anvi-get-sequences-for-gene-clusters, anvi-get-sequences-for-hmm-hits, anvi-get-short-reads-from-bam, anvi-get-short-reads-mapping-to-a-gene, anvi-get-split-coverages, anvi-get-tlen-dist-from-bam, anvi-help, anvi-import-collection, anvi-import-functions, anvi-import-items-order, anvi-import-misc-data, anvi-import-state, anvi-import-taxonomy-for-genes, anvi-import-taxonomy-for-layers, anvi-init-bam, anvi-inspect, anvi-interactive, anvi-matrix-to-newick, anvi-mcg-classifier, anvi-merge, anvi-merge-bins, anvi-merge-trnaseq, anvi-meta-pan-genome, anvi-migrate, anvi-oligotype-linkmers, anvi-pan-genome, anvi-plot-trnaseq, anvi-profile, anvi-profile-blitz, anvi-push, anvi-refine, anvi-rename-bins, anvi-report-inversions, anvi-report-linkmers, anvi-run-hmms, anvi-run-interacdome, anvi-run-kegg-kofams, anvi-run-ncbi-cogs, anvi-run-pfams, anvi-run-scg-taxonomy, anvi-run-trna-taxonomy, anvi-run-workflow, anvi-scan-trnas, anvi-search-functions, anvi-search-palindromes, anvi-search-sequence-motifs, anvi-self-test, anvi-setup-interacdome, anvi-setup-kegg-kofams, anvi-setup-ncbi-cogs, anvi-setup-pdb-database, anvi-setup-pfams, anvi-setup-scg-taxonomy, anvi-setup-trna-taxonomy, anvi-show-collections-and-bins, anvi-show-misc-data, anvi-split, anvi-summarize, anvi-summarize-blitz, anvi-tabulate-trnaseq, anvi-trnaseq, anvi-update-db-description, anvi-update-structure-database, anvi-upgrade.

Ad hoc anvi’o scripts (35) anvi-script-add-default-collection, anvi-script-augustus-output-to-external-gene-calls, anvi-script-checkm-tree-to-interactive, anvi-script-compute-ani-for-fasta, anvi-script-estimate-genome-size, anvi-script-filter-fasta-by-blast, anvi-script-filter-hmm-hits-table, anvi-script-fix-homopolymer-indels, anvi-script-gen-CPR-classifier, anvi-script-gen-distribution-of-genes-in-a-bin, anvi-script-gen-functions-per-group-stats-output, anvi-script-gen-genomes-file, anvi-script-gen-help-pages, anvi-script-gen-hmm-hits-matrix-across-genomes, anvi-script-gen-programs-network, anvi-script-gen-programs-vignette, anvi-script-gen-pseudo-paired-reads-from-fastq, anvi-script-gen-scg-domain-classifier, anvi-script-gen-short-reads, anvi-script-gen_stats_for_single_copy_genes.sh, anvi-script-get-coverage-from-bam, anvi-script-get-hmm-hits-per-gene-call, anvi-script-get-primer-matches, anvi-script-merge-collections, anvi-script-permute-trnaseq-seeds, anvi-script-pfam-accessions-to-hmms-directory, anvi-script-predict-CPR-genomes, anvi-script-process-genbank, anvi-script-process-genbank-metadata, anvi-script-reformat-fasta, anvi-script-run-eggnog-mapper, anvi-script-snvs-to-interactive, anvi-script-tabulate, anvi-script-transpose-matrix, anvi-script-variability-to-vcf.


Programs

Please let us know if there is something unclear in this output.

anvi-analyze-synteny

Extract ngrams, as in 'co-occurring genes in synteny', from genomes

Usage

usage: anvi-analyze-synteny [-h] -g GENOMES_STORAGE
                            [--ngram-window-range NGRAM_WINDOW_RANGE]
                            [-o FILE_PATH] [--annotation-source SOURCE NAME]
                            [-p PAN_DB] [-n NGRAM_SOURCE] [-l]
                            [--analyze-unknown-functions] [-G GENOME_NAMES]
                            [--first-functional-hit-only]

Parameters

Essential INPUT:

  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)
  --ngram-window-range NGRAM_WINDOW_RANGE
                        The range of window sizes of Ngrams to analyze for
                        synteny patterns.Please format the window-range as x:y
                        (e.g. Window sizes 2 to 4 would be denoted as: 2:4)
                        (default: 2:3)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

Annotation sources for Ngrams: Choose one source of annotations for your Ngrams.

  --annotation-source SOURCE NAME
                        Get functional annotations for a specific annotation
                        source. You can use the flag '--list-annotation-
                        sources' to learn about what sources are available.
                        (default: None)
  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -n NGRAM_SOURCE, --ngram-source NGRAM_SOURCE
                        If two annotation sources are provided, please choose
                        one annotation source that will be used to calcuate
                        Ngrams (e.g. gene_clusters, COG_FUNCTION) (default:
                        None)

Optional arguments:

  -l, --list-annotation-sources
                        List available functional annotation sources.
                        (default: False)
  --analyze-unknown-functions
                        Provide this flag if you want anvi-analyze-synteny to
                        report Ngrams that contain gene calls that have no
                        annotation. (default: False)
  -G GENOME_NAMES, --genome-names GENOME_NAMES
                        Genome names to 'focus'. You can use this parameter to
                        limit the genomes included in your analysis. You can
                        provide these names as a comma-separated list of
                        names, or you can put them in a file, where you have a
                        single genome name in each line, and provide the file
                        path. (default: None)
  --first-functional-hit-only
                        Use this flag if you want to use on the first
                        functional annotation when making ngrams and assigning
                        annotations. In some cases, anvio reports more than
                        one annotation when there are multiple good hits to
                        the gene. When this happens, all annotations will be
                        reported in order of alignment score and delimited by
                        '!!!' e.g. 'COG123!!!COG456!!!COG789'. This flag will
                        report 'COG123!!!COG456!!!COG789' as 'COG123'.
                        (default: False)

anvi-compute-completeness

A script to generate completeness info for a given list of splits

Usage

usage: anvi-compute-completeness [-h] [--splits-of-interest FILE] -c
                                 CONTIGS_DB [-e E-VALUE]
                                 [--list-completeness-sources]
                                 [--completeness-source NAME]

Parameters

optional arguments:

  --splits-of-interest FILE
                        A file with split names. There should be only one
                        column in the file, and each line should correspond to
                        a unique split name. (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -e E-VALUE, --min-e-value E-VALUE
                        Minimum significance score of an HMM find to be
                        considered as a valid hit. Default is 1e-15.
  --list-completeness-sources
                        Show available sources and exit. (default: False)
  --completeness-source NAME
                        Single-copy gene source to use to estimate
                        completeness. (default: None)

anvi-compute-functional-enrichment-across-genomes

A program that computes functional enrichment across groups of genomes.

functions

Example uses and other resources

Usage

usage: anvi-compute-functional-enrichment-across-genomes [-h] [-e FILE_PATH]
                                                         [-i FILE_PATH]
                                                         [-g GENOMES_STORAGE]
                                                         -G TEXT_FILE
                                                         [--annotation-source SOURCE NAME]
                                                         [-l] -o FILE_PATH
                                                         [--include-ungrouped]
                                                         [-F FILE]
                                                         [--just-do-it]

Parameters

SOURCES OF GENOMES: To estimate enriched functions across your geonomes, you can provide any combination of the following: an external genomes file, an internal genomes file, and/or a genomes storage file. Anvi'o will aggregate all functions in genomes found in any of these sources, and will do its magic.

  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

RELATIONSHIPS BETWEEN GENOMES: Here you need to provide a groups file so anvi'o can understand which genome is in which group.

  -G TEXT_FILE, --groups-txt TEXT_FILE
                        A tab-delimited text file specifying which group each
                        item belongs to. Depending on the context, items here
                        may be individual samples or genomes. The first column
                        must contain item names matching to those that are in
                        your input data. A different column should have the
                        header 'group' and contain the group name for each
                        item. Each item should be associated with a single
                        group. It is always a good idea to define groups using
                        single words without any fancy characters. For
                        instance, `HIGH_TEMPERATURE` or `LOW_FITNESS` are good
                        group names. `my group #1` or `IS-THIS-OK?`, are not
                        good group names. (default: None)

FUNCTION ANNOTATION SOURCE: Here you tell anvi'o which function annotation source to use. Obviously, any given function annotation source will have to be common to all genomes. You can see what you have in any of the contigs databases you can use the program anvi-db-info. YOU'RE ALMOST THERE.

  --annotation-source SOURCE NAME
                        Get functional annotations for a specific annotation
                        source. You can use the flag '--list-annotation-
                        sources' to learn about what sources are available.
                        (default: None)
  -l, --list-annotation-sources
                        List available functional annotation sources.
                        (default: False)

OUTPUT: What comes out the other end. The only thing mandatory here is the output file.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --include-ungrouped   Use this flag if you want anvi'o to include
                        genomes/samples with no group in the analysis. (For
                        pangenomes, this means the genome has no value set for
                        the category variable which you specified using
                        --category-variable. For modules, this means the
                        sample has no group specified in the groups-txt file.
                        And for regular 'ol genomes, this means the genome has
                        nothing in the 'group' column of the input file). By
                        default all variables with no value will be ignored,
                        but if you apply this flag, they will instead be
                        considered as a single group (called 'UNGROUPED' when
                        performing the statistical analysis. (default: False)
  -F FILE, --functional-occurrence-table-output FILE
                        Saves the occurrence frequency information for
                        functions in genomes in a TAB-delimited format. A file
                        name must be provided. To learn more about how the
                        functional occurrence is computed, please refer to the
                        tutorial. (default: None)

OPTIONAL THINGIES: If you want it, here it is, come and get it.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-compute-functional-enrichment-in-pan

A program that computes functional enrichment within a pangenome.

pangenomics functions

Example uses and other resources

Usage

usage: anvi-compute-functional-enrichment-in-pan [-h] -p PAN_DB
                                                 [-g GENOMES_STORAGE]
                                                 [--category-variable CATEGORY]
                                                 [--annotation-source SOURCE NAME]
                                                 [--include-gc-identity-as-function]
                                                 [-l] -o FILE_PATH [-F FILE]
                                                 [--just-do-it]

Parameters

CRITICAL INPUTS: Give anvi'o what it wants.

  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)
  --category-variable CATEGORY
                        The additional layers data variable name that divides
                        layers into multiple categories. (default: None)

FUNCTION ANNOTATION SOURCE: Here you tell anvi'o which function annotation source to use.

  --annotation-source SOURCE NAME
                        Get functional annotations for a specific annotation
                        source. You can use the flag '--list-annotation-
                        sources' to learn about what sources are available.
                        (default: None)
  --include-gc-identity-as-function
                        This is an option that asks anvi'o to treat gene
                        cluster names as functions. By doing so, you are in
                        fact creating an opportunity to study functional
                        enrichment statistics for each gene cluster
                        independently. For instance, multiple gene clusters
                        may have the same COG function. But if you wish to use
                        the same enrichment analysis in your pangenome without
                        collapsing multiple gene clusters into a single
                        function name, you can use this flag, and ask for
                        'IDENTITY' as the functional annotation source.
                        (default: False)
  -l, --list-annotation-sources
                        List available functional annotation sources.
                        (default: False)

OUTPUT OPTIONS: What comes out the other end. (Please provide at least the output file name.)

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

OUTPUT OPTIONS FOR FUNCTIONAL ENRICHMENT: Reporting options that only make sense for input option #1 or #3.

  -F FILE, --functional-occurrence-table-output FILE
                        Saves the occurrence frequency information for
                        functions in genomes in a TAB-delimited format. A file
                        name must be provided. To learn more about how the
                        functional occurrence is computed, please refer to the
                        tutorial. (default: None)

OPTIONAL THINGIES: If you want it, here it is, come and get it.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-compute-gene-cluster-homogeneity

Compute homogeneity for gene clusters

Example uses and other resources

Usage

usage: anvi-compute-gene-cluster-homogeneity [-h] -p PAN_DB
                                             [-g GENOMES_STORAGE]
                                             [-o FILE_PATH] [--store-in-db]
                                             [--gene-cluster-id GENE_CLUSTER_ID]
                                             [--gene-cluster-ids-file FILE_PATH]
                                             [-C COLLECTION_NAME]
                                             [-b BIN_NAME]
                                             [--quick-homogeneity]
                                             [-T NUM_THREADS] [--just-do-it]

Parameters

INPUT FILES: Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

REPORTING: How do you want results to be reported? Anvi'o can produce a TAB-delimited output file for you (for which you would have to provide an output file name). Or the results can be stored in the pan database directly, for which you would have to explicitly ask for it. You can get both as well in case you are a fan of redundancy and poor data analysis practices. Anvi'o does not judge.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --store-in-db         Store analysis results into the database directly.
                        (default: False)

SELECTION: Which gene clusters should be analyzed. You can ask for a single gene cluster, or multiple ones listed in a file, or you can use a collection and bin name to list gene clusters of interest.

  --gene-cluster-id GENE_CLUSTER_ID
                        Gene cluster ID you are interested in. (default: None)
  --gene-cluster-ids-file FILE_PATH
                        Text file for gene clusters (each line should contain
                        be a unique gene cluster id). (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)

OPTIONAL: Optional stuff available for you to use

  --quick-homogeneity   By default, anvi'o will use a homogeneity algorithm
                        that checks for horizontal and vertical geometric
                        homogeneity (along with functional). With this flag,
                        you can tell anvi'o to skip horizontal geometric
                        homogeneity calculations. It will be less accurate but
                        quicker. (default: False)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-compute-genome-similarity

Export sequences from sequence sources and compute a similarity metric (e.g. ANI). If a Pan Database is given anvi'o will write computed output to misc data tables of Pan Database

ani dereplication redundancy

Example uses and other resources

Usage

usage: anvi-compute-genome-similarity [-h] [-i FILE_PATH] [-e FILE_PATH]
                                      [-f FASTA_TEXT_FILE] -o DIR_PATH
                                      [-p PAN_DB]
                                      [--program {pyANI,fastANI,sourmash}]
                                      [--fastani-kmer-size FASTANI_KMER_SIZE]
                                      [--fragment-length FRAGMENT_LENGTH]
                                      [--min-num-fragments MIN_NUM_FRAGMENTS]
                                      [--method {ANIm,ANIb,ANIblastall,TETRA}]
                                      [--min-alignment-fraction NUM]
                                      [--significant-alignment-length INT]
                                      [--min-full-percent-identity FULL_PERCENT_IDENTITY]
                                      [--kmer-size INT] [--scale INT]
                                      [--distance DISTANCE_METRIC]
                                      [--linkage LINKAGE_METHOD]
                                      [-T NUM_THREADS] [--just-do-it]
                                      [--log-file FILE_PATH]

Parameters

INPUT OPTIONS: Tell anvi'o what you want.

  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)
  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -f FASTA_TEXT_FILE, --fasta-text-file FASTA_TEXT_FILE
                        A two-column TAB-delimited file that lists multiple
                        FASTA files to import for analysis. If using for
                        `anvi-dereplicate-genomes` or `anvi-compute-distance`,
                        each FASTA is assumed to be a genome. The first item
                        in the header line should read 'name', and the second
                        item should read 'path'. Each line in the field should
                        describe a single entry, where the first column is the
                        name of the FASTA file or corresponding sequence, and
                        the second column is the path to the FASTA file
                        itself. (default: None)

OUTPUT OPTIONS: Tell anvi'o where to store your results.

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -p PAN_DB, --pan-db PAN_DB
                        This is totally optional, but very useful when
                        applicable. If you are running this for genomes for
                        which you already have an anvi'o pangeome, then you
                        can show where the pan database is and anvi'o would
                        automatically add the results into the misc data
                        tables of your pangenome. Those data can then be shown
                        as heatmaps on the pan interactive interface through
                        the 'layers' tab. (default: None)

Program: Tell anvi'o which similarity program to run.

  --program {pyANI,fastANI,sourmash}
                        Tell anvi'o which program to run to process genome
                        similarity. For ANI, you should either use pyANI or
                        fastANI. If accuracy is paramount (for example,
                        distinguishing things less than 1 percent different),
                        or for dealing with genomes < 80 percent similar,
                        pyANI is what we recommend. However, fastANI is much
                        faster. If you for some reason want to use mash
                        similarity, you can use sourmash, but its really not
                        intended for genome comparisons. If you don't choose
                        anything here, anvi'o will reluctantly set the program
                        to pyANI, but you really should be the one who is on
                        top of these things. (default: pyANI)

fastANI Settings: Tell anvi'o to tell fastANI what settings to set. Only if --program is set to fastANI

  --fastani-kmer-size FASTANI_KMER_SIZE
                        Choose a kmer. The default is 16.
  --fragment-length FRAGMENT_LENGTH
                        Choose a fragment length. The default is 3000.
  --min-num-fragments MIN_NUM_FRAGMENTS
                        Choose the minimum number of fragment lengths that can
                        be trusted. The default is 50.

pyANI Settings: Tell anvi'o to tell pyANI what method you wish to use and what settings to set. Only if --program is set to pyANI

  --method {ANIm,ANIb,ANIblastall,TETRA}
                        Method for pyANI. The default is ANIb. You must have
                        the necessary binary in path for whichever method you
                        choose. According to the pyANI help for v0.2.7 at
                        https://github.com/widdowquinn/pyani, the method
                        'ANIm' uses MUMmer (NUCmer) to align the input
                        sequences. 'ANIb' uses BLASTN+ to align 1020nt
                        fragments of the input sequences. 'ANIblastall': uses
                        the legacy BLASTN to align 1020nt fragments Finally,
                        'TETRA': calculates tetranucleotide frequencies of
                        each input sequence
  --min-alignment-fraction NUM
                        In some cases you may get high raw ANI estimates
                        (percent identity scores) between two genomes that
                        have little to do with each other simply because only
                        a small fraction of their content may be aligned. This
                        filter will set all ANI scores between two genomes to
                        0 if the alignment fraction is less than you deem
                        trustable. When you set a value, anvi'o will go
                        through the ANI results, and set percent identity
                        scores between two genomes to 0 if the alignment
                        fraction *between either of them* is less than the
                        parameter described here. The default is 0.
  --significant-alignment-length INT
                        So --min-alignment-fraction discards any hit that is
                        coming from alignments that represent shorter
                        fractions of genomes, but what if you still don't want
                        to miss an alignment that is longer than an X number
                        of nucleotides regardless of what fraction of the
                        genome it represents? Well, this parameter is to
                        recover things that may be lost due to --min-
                        alignment-fraction parameter. Let's say, if you set
                        --min-alignment-fraction to '0.05', and this parameter
                        to '5000', anvi'o will keep hits from alignments that
                        are longer than 5000 nts, EVEN IF THEY REPRESENT less
                        than 5 percent of a given genome pair. Basically if
                        --min-alignment-fraction is your shield to protect
                        yourself from incoming garbage, --significant-
                        alignment-length is your chopstick to pick out those
                        that may be interesting, and you are a true warrior
                        here. (default: None)
  --min-full-percent-identity FULL_PERCENT_IDENTITY
                        In some cases you may get high raw ANI estimates
                        (percent identity scores) between two genomes that
                        have little to do with each other simply because only
                        a small fraction of their content may be aligned. This
                        can be partly alleviated by considering the *full*
                        percent identity, which includes in its calculation
                        regions that did not align. For example, if the
                        alignment is a whopping 97 percent identity but only 8
                        percent of the genome aligned, the *full* percent
                        identity is 0.970 * 0.080 = 0.078 OR 7.8 percent.
                        *full* percent identity is always included in the
                        report, but you can also use it as a filter for other
                        metrics, such as percent identity. This filter will
                        set all ANI measures between two genomes to 0 if the
                        *full* percent identity is less than you deem
                        trustable. When you set a value, anvi'o will go
                        through the ANI results, and set all ANI measures
                        between two genomes to 0 if the *full* percent
                        identity *between either of them* is less than the
                        parameter described here. The default is 0.

Sourmash Settings: Tell anvi'o to tell sourmash what settings to set. Only if --program is set to sourmash

  --kmer-size INT       Set the k-mer size for mash similarity checks. We
                        found 13 in almost all cases correlates best with
                        alignment-based ANI. (default: None)
  --scale INT           Set the compression ratio for fasta signature file
                        computations. The default is 1000. Smaller ratios
                        decrease sensitivity, while larger ratios will lead to
                        large fasta signatures. (default: 1000)

HIERARCHICAL CLUSTERING: anvi-compute-genome-similarity outputs similarity matrix files, which can be clustered into nice looking dendrograms to display the relationships between genomes nicely (in the anvi'o interface and elsewhere). Here you can set the distance metric and the linkage algorithm for that.

  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        The default is "euclidean".
  --linkage LINKAGE_METHOD
                        The linkage method for the hierarchical clustering.
                        The default is "ward".

OTHER IMPORTANT STUFF: Yes. You're almost done.

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  --log-file FILE_PATH  File path to store debug/output messages. (default:
                        None)

anvi-compute-metabolic-enrichment

A program that computes metabolic enrichment acros groups of genomes and metagenomes

metabolism

Example uses and other resources

Usage

usage: anvi-compute-metabolic-enrichment [-h] [-M TEXT_FILE] [-G TEXT_FILE] -o
                                         FILE_PATH
                                         [--sample-header SAMPLE_HEADER]
                                         [--module-completion-threshold NUM]
                                         [--include-samples-missing-from-groups-txt]
                                         [--just-do-it]

Parameters

CRITICAL INPUTS: This program requires TWO INPUT FILES: (1) an output file produced by the program anvi-estimate-metabolism as input and (2) a groups file (using the --groups-txt parameter) to specify which sample is in which group.

  -M TEXT_FILE, --modules-txt TEXT_FILE
                        A tab-delimited text file specifying module
                        completeness in every genome/MAG/sample that you are
                        interested in. The best way to get this file is to run
                        `anvi-estimate-metabolism --kegg-output-modes modules`
                        on your samples of interest. Trust us. (default: None)
  -G TEXT_FILE, --groups-txt TEXT_FILE
                        A tab-delimited text file specifying which group each
                        item belongs to. Depending on the context, items here
                        may be individual samples or genomes. The first column
                        must contain item names matching to those that are in
                        your input data. A different column should have the
                        header 'group' and contain the group name for each
                        item. Each item should be associated with a single
                        group. It is always a good idea to define groups using
                        single words without any fancy characters. For
                        instance, `HIGH_TEMPERATURE` or `LOW_FITNESS` are good
                        group names. `my group #1` or `IS-THIS-OK?`, are not
                        good group names. (default: None)
  --sample-header SAMPLE_HEADER
                        The header of the column containing your sample names
                        in the modules-txt input file. By default this is
                        'db_name' because we are assuming you got your modules
                        mode output by running `anvi-estimate-metabolism` in
                        multi mode (on multiple genomes or metagenomes), but
                        just in case you got it a different way, this is how
                        you can tell anvi'o which column to look at. The
                        values in this column should correspond to those in
                        the 'sample' column in the groups-txt input file.
                        (default: db_name)
  --module-completion-threshold NUM
                        This threshold defines the percent completeness score
                        at which we consider a KEGG module to be 'present'in a
                        given sample. That is, if a module's completeness in a
                        sample is less than this value, then we say the module
                        is not present in that sample, and this will affect
                        the module's enrichment score. By extension, if a
                        module's completeness is less than this value in all
                        samples, it will have a very very low enrichment score
                        (ie, it will not be considered enriched at all,
                        because it doesn't occur in any groups). Note that the
                        closer this number is to 0, the more meaningless this
                        whole enrichment analysis is... but hey, this is your
                        show. This threshold CAN be different from the
                        completeness threshold used in `anvi-estimate-
                        metabolism` if you wish. The default threshold is
                        0.75.
  --include-samples-missing-from-groups-txt
                        Sometimes, you might have some sample names in your
                        modules-txt file that you did not include in the
                        groups-txt file. This is fine. By default, we will
                        ignore those samples because they do not have a group.
                        But if you use this flag, then instead those samples
                        will be included in a group called 'UNGROUPED'. Be
                        cautious when using this flag in combination with the
                        --include-ungrouped flag (which also sticks samples
                        without groups into the 'UNGROUPED' group) so that you
                        don't accidentally group together samples that are not
                        supposed to be friends. (default: False)

OUTPUT OPTIONS: What comes out the other end. (Please provide at least the output file name.)

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

OPTIONAL OPTIONS: If you want it, here it is, come and get it.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-db-info

Access self tables, display values, or set new ones totally on your own risk

Usage

usage: anvi-db-info [-h] [--self-key SELF_KEY] [--self-value SELF_VALUE]
                    [--just-do-it]
                    DATABASE_PATH

Parameters

Input: The database path you wish to access.

  DATABASE_PATH         An anvi'o database for pan, profile, contigs, or
                        auxiliary data

Very dangerous zone: For power users with extreme self-control and maturity.

  --self-key SELF_KEY   The key you wish to set or change. (default: None)
  --self-value SELF_VALUE
                        The value you wish to set for the self key. (default:
                        None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-delete-collection

Remove a collection from a given profile database

Usage

usage: anvi-delete-collection [-h] -p PROFILE_DB [-C COLLECTION_NAME]
                              [--list-collections]

Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  --list-collections    Show available collections and exit. (default: False)

anvi-delete-functions

Remove functional annotation sources from an anvi'o contigs database

Usage

usage: anvi-delete-functions [-h] -c CONTIGS_DB
                             [--annotation-sources SOURCE NAME[S]] [-l]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --annotation-sources SOURCE NAME[S]
                        One or more functional annotations sources to drop. If
                        you wish to remove more than one, separate them from
                        each other using a comma character without a space.
                        For example: 'SOURCE_1,SOURCE_2,SOURCE_3'. (default:
                        None)
  -l, --list-annotation-sources
                        List available functional annotation sources.
                        (default: False)

anvi-delete-hmms

Remove HMM hits from an anvi'o contigs database

Usage

usage: anvi-delete-hmms [-h] -c CONTIGS_DB [--hmm-source SOURCE NAME] [-l]
                        [--just-do-it]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --hmm-source SOURCE NAME
                        Use a specific HMM source. You can use '--list-hmm-
                        sources' flag to see a list of available resources.
                        The default is 'None'.
  -l, --list-hmm-sources
                        List available HMM sources in the contigs database and
                        quit. (default: False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-delete-misc-data

Remove stuff from 'additional data' or 'order' tables for either items or layers in either pan or profile databases. OR, remove stuff from the 'additional data' tables for nucleotides or amino acids in contigs databases

Example uses and other resources

Usage

usage: anvi-delete-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB] -t
                             NAME [--keys-to-remove KEYS_TO_REMOVE]
                             [--groups-to-remove GROUPS_TO_REMOVE]
                             [--list-available-keys] [--just-do-it]

Parameters

Database input: Provide 1 of these

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

Details: Everything else.

  -t NAME, --target-data-table NAME
                        The target table is the table you are interested in
                        accessing. Currently it can be 'items','layers', or
                        'layer_orders'. Please see most up-to-date online
                        documentation for more information. (default: None)
  --keys-to-remove KEYS_TO_REMOVE
                        A comma-separated list of data keys to remove from the
                        database. If you do not use this parameter, anvi'o
                        will simply remove everything from the target data
                        table immediately. Please note that you should not use
                        this parameter together with `--groups-to-remove` in a
                        single command. (default: None)
  --groups-to-remove GROUPS_TO_REMOVE
                        A comma-separated list of data groups to remove from
                        the database. If you do not use this parameter, anvi'o
                        will simply remove everything from the target data
                        table immediately. Please note that you should not use
                        this parameter together with `--keys-to-remove` in a
                        single command. (default: None)
  --list-available-keys
                        Using this flag will list available data keys in the
                        target data table and quit without doing anything
                        else. (default: False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-delete-state

Delete an anvi'o state from a pan or profile database

Usage

usage: anvi-delete-state [-h] -p PAN_OR_PROFILE_DB [-s STATE_NAME]
                         [--list-states]

Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -s STATE_NAME, --state STATE_NAME
                        The state name to ... delete :( (default: None)
  --list-states         Show available states and exit. (default: False)

anvi-dereplicate-genomes

Identify redundant (highly similar) genomes

Usage

usage: anvi-dereplicate-genomes [-h] [-i FILE_PATH] [-e FILE_PATH]
                                [-f FASTA_TEXT_FILE] [--ani-dir PATH]
                                [--mash-dir PATH] -o DIR_PATH
                                [--skip-fasta-report] [--report-all]
                                [--program {pyANI,fastANI,sourmash}]
                                [--fastani-kmer-size FASTANI_KMER_SIZE]
                                [--fragment-length FRAGMENT_LENGTH]
                                [--min-fraction MIN_FRACTION]
                                [--method {ANIm,ANIb,ANIblastall,TETRA}]
                                [--min-alignment-fraction NUM]
                                [--significant-alignment-length INT]
                                [--use-full-percent-identity]
                                [--min-full-percent-identity FULL_PERCENT_IDENTITY]
                                [--kmer-size INT] [--scale INT]
                                --similarity-threshold SIMILARITY_THRESHOLD
                                [--cluster-method {simple_greedy}]
                                [--representative-method {Qscore,length,centrality}]
                                [-T NUM_THREADS] [--just-do-it]
                                [--skip-checking-genome-hashes]
                                [--log-file FILE_PATH]

Parameters

INPUT OPTIONS: Tell anvi'o what you want.

  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)
  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -f FASTA_TEXT_FILE, --fasta-text-file FASTA_TEXT_FILE
                        A two-column TAB-delimited file that lists multiple
                        FASTA files to import for analysis. If using for
                        `anvi-dereplicate-genomes` or `anvi-compute-distance`,
                        each FASTA is assumed to be a genome. The first item
                        in the header line should read 'name', and the second
                        item should read 'path'. Each line in the field should
                        describe a single entry, where the first column is the
                        name of the FASTA file or corresponding sequence, and
                        the second column is the path to the FASTA file
                        itself. (default: None)

IMPORT RESULTS: Alternatively, if you have previous ANI or mash similarity computations on your genomes, you can import the result directory here to use. Please note that file names must remain unchanged for anvi'o to find them

  --ani-dir PATH        You can import the directory created by `anvi-compute-
                        genome-similarity` if `--program` parameter was set to
                        `fastANI` or `pyANI` and use it for dereplication
                        (default: None)
  --mash-dir PATH       You can import the directory created by `anvi-compute-
                        genome-similarity` if `--program` parameter was set to
                        `sourmash` and use it for dereplication (default:
                        None)

OUTPUT OPTIONS: Tell anvi'o where to store your results.

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  --skip-fasta-report   By default, if any sequence source is provided, FASTA
                        files of non-redundant genomes are reported. With this
                        flag, no FASTA files are reported. (default: False)
  --report-all          By default, only FASTA files of non-redundant genomes
                        are reported, i.e. single representatives from each
                        cluster. With this flag, all genome FASTAS will be
                        reported. (default: False)

Program: Tell anvi'o which similarity program to run.

  --program {pyANI,fastANI,sourmash}
                        Tell anvi'o which program to run to process genome
                        similarity. For ANI, you can either use pyANI or
                        fastANI. If accuracy is paramount (for example,
                        distinguishing things less than 1 percent different),
                        or for dealing with genomes < 80 percent similar,
                        pyANI is what we recommend. However, fastANI is much
                        faster. If you for some reason want to use mash
                        similarity, you can use sourmash, but its really not
                        intended for genome comparisons. (default: None)

fastANI Settings: Tell anvi'o to tell fastANI what settings to set. Only if --program is set to fastANI

  --fastani-kmer-size FASTANI_KMER_SIZE
                        Choose a kmer. The default is 16.
  --fragment-length FRAGMENT_LENGTH
                        Choose a fragment length. The default is 3000.
  --min-fraction MIN_FRACTION
                        Minimum fraction of alignment to be shared between
                        genome pairs to calculate ANI. If reference and query
                        genome size differ, smaller one among the two is
                        considered. The default is 0.25.

pyANI Settings: Tell anvi'o to tell pyANI what method you wish to use and what settings to set. Only if --program is set to pyANI

  --method {ANIm,ANIb,ANIblastall,TETRA}
                        Method for pyANI. The default is ANIb. You must have
                        the necessary binary in path for whichever method you
                        choose. According to the pyANI help for v0.2.7 at
                        https://github.com/widdowquinn/pyani, the method
                        'ANIm' uses MUMmer (NUCmer) to align the input
                        sequences. 'ANIb' uses BLASTN+ to align 1020nt
                        fragments of the input sequences. 'ANIblastall': uses
                        the legacy BLASTN to align 1020nt fragments Finally,
                        'TETRA': calculates tetranucleotide frequencies of
                        each input sequence
  --min-alignment-fraction NUM
                        In some cases you may get high raw ANI estimates
                        (percent identity scores) between two genomes that
                        have little to do with each other simply because only
                        a small fraction of their content may be aligned. This
                        filter will set all ANI scores between two genomes to
                        0 if the alignment fraction is less than you deem
                        trustable. When you set a value, anvi'o will go
                        through the ANI results, and set percent identity
                        scores between two genomes to 0 if the alignment
                        fraction *between either of them* is less than the
                        parameter described here. The default is 0.25.
  --significant-alignment-length INT
                        So --min-alignment-fraction discards any hit that is
                        coming from alignments that represent shorter
                        fractions of genomes, but what if you still don't want
                        to miss an alignment that is longer than an X number
                        of nucleotides regardless of what fraction of the
                        genome it represents? Well, this parameter is to
                        recover things that may be lost due to --min-
                        alignment-fraction parameter. Let's say, if you set
                        --min-alignment-fraction to '0.05', and this parameter
                        to '5000', anvi'o will keep hits from alignments that
                        are longer than 5000 nts, EVEN IF THEY REPRESENT less
                        than 5 percent of a given genome pair. Basically if
                        --min-alignment-fraction is your shield to protect
                        yourself from incoming garbage, --significant-
                        alignment-length is your chopstick to pick out those
                        that may be interesting, and you are a true warrior
                        here. (default: None)
  --use-full-percent-identity
                        Usually, percent identity is calculated only over
                        aligned regions, and this is what is used as a
                        distance metric by default. But with this flag, you
                        can instead use the *full* percent identity as the
                        distance metric. It is the same as percent identity,
                        except that regions that did not align are included in
                        the calculation. This means *full* percent identity
                        will always be less than or equal to percent identity.
                        How is it calculated? Well if P is the percentage
                        identity calculated in aligned regions, L is the
                        length of the genome, and A is the fraction of the
                        genome that aligned to a compared genome, the full
                        percent identity is P * (A/L). In other words, it is
                        the percent identity multiplied by the alignment
                        coverage. For example, if the alignment is a whopping
                        97 percent identity but only 8 percent of the genome
                        aligned, the *full* percent identity is 0.970 * 0.080
                        = 0.078, which is just 7.8 percent. (default: False)
  --min-full-percent-identity FULL_PERCENT_IDENTITY
                        In some cases you may get high raw ANI estimates
                        (percent identity scores) between two genomes that
                        have little to do with each other simply because only
                        a small fraction of their content may be aligned. This
                        can be partly alleviated by considering the *full*
                        percent identity, which includes in its calculation
                        regions that did not align. For example, if the
                        alignment is a whopping 97 percent identity but only 8
                        percent of the genome aligned, the *full* percent
                        identity is 0.970 * 0.080 = 0.078 OR 7.8 percent.
                        *full* percent identity is always included in the
                        report, but you can also use it as a filter for other
                        metrics, such as percent identity. This filter will
                        set all ANI measures between two genomes to 0 if the
                        *full* percent identity is less than you deem
                        trustable. When you set a value, anvi'o will go
                        through the ANI results, and set all ANI measures
                        between two genomes to 0 if the *full* percent
                        identity *between either of them* is less than the
                        parameter described here. The default is 20.

sourmash settings: Tell anvi'o to run sourmash with specific settings. Only if --program is set to sourmash

  --kmer-size INT       Set the k-mer size for mash similarity checks. The
                        default is 13.
  --scale INT           Set the compression ratio for fasta signature file
                        computations. The default is 1000. Smaller ratios
                        decrease sensitivity, while larger ratios will lead to
                        large fasta signatures. (default: 1000)

Dereplication Parameters: Some parameters to guide your dereplication

  --similarity-threshold SIMILARITY_THRESHOLD
                        If two genomes have a similarity greater than or equal
                        to this threshold, they will belong to the same
                        cluster. Since measures of 'similarity' depend
                        strongly on what method is used for calculation, and
                        since the threshold at which two genomes should be
                        considered 'similar enough' to be considered redundant
                        will depend on the application, anvi'o refuses to
                        provide a default parameter. If you're using pyANI,
                        maybe 0.90 is what you're after. If you're using
                        sourmash, maybe 0.25 is what you're after. Or maybe
                        not? Anvi'o is feeling nervous about this decision.
                        (default: None)
  --cluster-method {simple_greedy}
                        Currently, genomes are clustered based on a simple
                        greedy algorithm. Let's say your similarity threshold
                        is 0.90. If genome A is 0.95 similar to B, and B is
                        0.95 similar to C, and C is 0.95 similar to D, then
                        {A,B,C,D} will form a cluster. This is *even though* D
                        may share a similarity to A of merely 0.80, which is
                        below similarity threshold. You want better
                        alternatives? Contact the developers with your ideas.
                        (default: simple_greedy)
  --representative-method {Qscore,length,centrality}
                        After genomes are grouped into redundancy clusters,
                        you can define how anvi'o picks the representative
                        genome from the cluster. 'Qscore' computes the genome
                        with the highest completion and lowest redundancy as
                        the representative. 'length' returns the longest
                        genome. 'centrality' returns the genome with the
                        highest average similarity to everything in the
                        cluster, i.e. the most central. The default is
                        centrality

OTHER IMPORTANT STUFF: Yes. You're almost done.

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  --skip-checking-genome-hashes
                        Use this flag if you would like anvi'o to skip
                        checking genome hashes. This is only relevant if you
                        may have genomes in your internal or external genomes
                        files that have identical sequences with different
                        names AND if you are OK with it. You may be OK with
                        it, for instance, if you are using `anvi-dereplicate-
                        genomes` program to dereplicate genomes desribed in
                        multiple collections in an anvi'o profile database
                        that may be describing the same genome multiple times
                        (see https://github.com/merenlab/anvio/issues/1397 for
                        a case). (default: False)
  --log-file FILE_PATH  File path to store debug/output messages. (default:
                        None)

anvi-display-contigs-stats

Start the anvi'o interactive interactive for viewing or comparing contigs statistics

Usage

usage: anvi-display-contigs-stats [-h] [--report-as-text] [-o FILE_PATH]
                                  [--dry-run] [-I IP_ADDR] [-P INT]
                                  [--browser-path PATH] [--server-only]
                                  [--password-protected]
                                  CONTIG DATABASES) [CONTIG DATABASE(S ...]

Parameters

positional arguments:

  CONTIG DATABASE(S)    Anvio'o Contig databases to display statistics, you
                        can give multiple databases by seperating them with
                        space.

optional arguments:

  -h, --help            show this help message and exit

REPORT CONFIGURATION: Specify what kind of output you want.

  --report-as-text      If you give this flag, Anvi'o will not open new
                        browser to show Contigs database statistics and write
                        all stats to TAB separated file and you should also
                        give --output-file with this flag otherwise Anvi'o
                        will complain. (default: False)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

SERVER CONFIGURATION: For power users.

  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)
  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --browser-path PATH   By default, anvi'o will use your default browser to
                        launch the interactive interface. If you would like to
                        use something else than your system default, you can
                        provide a full path for an alternative browser using
                        this parameter, and hope for the best. For instance we
                        are using this parameter to call Google's experimental
                        browser, Canary, which performs better with demanding
                        visualizations. (default: None)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)
  --password-protected  If this flag is set, command line tool will ask you to
                        enter a password and interactive interface will be
                        only accessible after entering same password. This
                        option is recommended for shared machines like
                        clusters or shared networks where computers are not
                        isolated. (default: False)

anvi-display-functions

Start an anvi'o interactive display to see functions across genomes

Usage

usage: anvi-display-functions [-h] [-i FILE_PATH] [-e FILE_PATH]
                              [-g GENOMES_STORAGE] [-G TEXT_FILE]
                              [--print-genome-names-and-quit]
                              --annotation-source SOURCE NAME
                              [--aggregate-based-on-accession]
                              [--aggregate-using-all-hits]
                              [--min-occurrence NUM GENOMES] -p PROFILE_DB
                              [--title NAME] [--state-autoload NAME]
                              [--collection-autoload NAME]
                              [--export-svg FILE_PATH] [--dry-run]
                              [--skip-news] [-I IP_ADDR] [-P INT]
                              [--browser-path PATH] [--read-only]
                              [--server-only] [--password-protected]
                              [--user-server-shutdown]

Parameters

GENOMES: Tell anvi'o where your genomes are.

  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)
  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

GROUPS: If you want, you can also tell anvi'o how to group your genomes so it can also compute functional enrichment between them.

  -G TEXT_FILE, --groups-txt TEXT_FILE
                        A tab-delimited text file specifying which group each
                        item belongs to. Depending on the context, items here
                        may be individual samples or genomes. The first column
                        must contain item names matching to those that are in
                        your input data. A different column should have the
                        header 'group' and contain the group name for each
                        item. Each item should be associated with a single
                        group. It is always a good idea to define groups using
                        single words without any fancy characters. For
                        instance, `HIGH_TEMPERATURE` or `LOW_FITNESS` are good
                        group names. `my group #1` or `IS-THIS-OK?`, are not
                        good group names. (default: None)
  --print-genome-names-and-quit
                        Sometimes, especially when you are interested in
                        creating a groups file for your genomes you gather
                        from multiple different sources, it may be difficult
                        to know every single genome name that will go into
                        your analysis. If you declare this flag, after
                        initializing everyghing, anvi'o will print out every
                        genome name it found and quit, so you can actually put
                        together a groups file for them. (default: False)

FUNCTIONS: Tell anvi'o which functional annotation source you like above all, and other important details you like about your analysis.

  --annotation-source SOURCE NAME
                        Get functional annotations for a specific annotation
                        source. You can use the flag '--list-annotation-
                        sources' to learn about what sources are available.
                        (default: None)
  --aggregate-based-on-accession
                        This is important. When anvi'o aggregates functions
                        for functional enrichment analyses or to display them,
                        it uses by default the 'function text' as keys. This
                        is because multiple accession IDs in various databases
                        may correspond to the same function, and when you are
                        doing a functional enrichment analysis, you most
                        likely would like to avoid over-splitting of functions
                        due to this. But then how can we know if you are doing
                        something that requires things to be aggregated based
                        on accession ids for functions rather than actual
                        functions? We can't. But we have this flag here so you
                        can instruct anvi'o to listen to you and not to us.
                        (default: False)
  --aggregate-using-all-hits
                        This program will aggregate functions based on best
                        hits only, and this flag will change that behavior. In
                        some cases a gene may be annotated with multiple
                        functions. This is a decision often made at the level
                        of function annotation tool. For instance, when you
                        run `anvi-run-ncbi-cogs`, you may end up having two
                        COG annotations for a single gene because the gene hit
                        both of them with significance scores that were above
                        the default noise cutoff. While this can be useful
                        when one visualizes functions or works with an `anvi-
                        summarize` output where things should be most
                        comprehensive, having some genes annotated with
                        multiple functions and others with one function may
                        over-split them (since in this scenario a gene with
                        COGXXX and COGXXX;COGYYY would end up in different
                        bins). Thus, when working on functional enrichment
                        analyses or displaying functions anvi'o will only use
                        the best hit for any gene that has multiple hits by
                        default. But you can turn that behavior off explicitly
                        and show anvi'o who is the boss by using this flag.
                        (default: False)
  --min-occurrence NUM GENOMES
                        The minimum number of occurrence of any given function
                        accross genomes. If you set a value, those functions
                        that occur in less number of genomes will be excluded.
                        (default: 1)

PROFILE DB: To store visuals state, collections, and such. It will be AUTOMATICALLY generated for you, and you can't use an existing profile for this. But then once it is generated you can use that profile with anvi- interactive. It is actually objectively very cool.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)

VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.

  --title NAME          Title for the interface. If you are working with a
                        RUNINFO dict, the title will be determined based on
                        information stored in that file. Regardless, you can
                        override that value using this parameter. (default:
                        None)
  --state-autoload NAME
                        Automatically load previous saved state and draw tree.
                        To see a list of available states, use --show-states
                        flag. (default: None)
  --collection-autoload NAME
                        Automatically load a collection and draw tree. To see
                        a list of available collections, use --list-
                        collections flag. (default: None)
  --export-svg FILE_PATH
                        The SVG output file path. (default: None)

SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).

  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)
  --skip-news           Don't try to read news content from upstream.
                        (default: False)

SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --browser-path PATH   By default, anvi'o will use your default browser to
                        launch the interactive interface. If you would like to
                        use something else than your system default, you can
                        provide a full path for an alternative browser using
                        this parameter, and hope for the best. For instance we
                        are using this parameter to call Google's experimental
                        browser, Canary, which performs better with demanding
                        visualizations. (default: None)
  --read-only           When the interactive interface is started with this
                        flag, all 'database write' operations will be
                        disabled. (default: False)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)
  --password-protected  If this flag is set, command line tool will ask you to
                        enter a password and interactive interface will be
                        only accessible after entering same password. This
                        option is recommended for shared machines like
                        clusters or shared networks where computers are not
                        isolated. (default: False)
  --user-server-shutdown
                        Allow users to shutdown an anvi'server via web
                        interface. (default: False)

anvi-display-metabolism

Start the anvi'o interactive interactive for viewing KEGG metabolism data

Usage

usage: anvi-display-metabolism [-h] -c CONTIGS_DB [-m]
                               [--kegg-data-dir KEGG_DATA_DIR] [-p PROFILE_DB]
                               [-C COLLECTION_NAME] [-b BIN_NAME]
                               [-B FILE_PATH]
                               [--module-completion-threshold NUM]
                               [-I IP_ADDR] [-P INT] [--browser-path PATH]
                               [--server-only] [--password-protected]

Parameters

INPUT: The minimum you must provide this program is a contigs database. In which case anvi'o will attempt to estimate and display metabolism for all contigs in it, assuming that the contigs database represents a single genome. If the contigs database is actually a metagenome, you should use the --metagenome flag to explicitly declare that.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -m, --metagenome-mode
                        Treat a given contigs database as a metagenome rather
                        than treating it as a single genome. (default: False)
  --kegg-data-dir KEGG_DATA_DIR
                        The directory path for your KEGG setup, which will
                        include things like KOfam profiles and KEGG MODULE
                        data. Anvi'o will try to use the default path if you
                        do not specify anything. (default: None)

ADDITIONAL INPUT: If you also provide a profile database AND a collection name, anvi'o will estimate metabolism separately for each bin in your collection. You can also limit those estimates to a specific bin or set of bins in the collection using the parameters --bin-id or --bin-ids-file, respectively.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)

OUTPUT: Parameters for controlling estimation output. The output will be TAB- delimited files which by default are prefixed with 'kegg-metabolism', but you can of course change that name here.

  --module-completion-threshold NUM
                        This threshold defines the point at which we consider
                        a KEGG module to be 'complete' or 'present' in a given
                        genome or bin. It is the fraction of steps that must
                        be complete in in order for the entire module to be
                        marked complete. The default is 0.75.

SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --browser-path PATH   By default, anvi'o will use your default browser to
                        launch the interactive interface. If you would like to
                        use something else than your system default, you can
                        provide a full path for an alternative browser using
                        this parameter, and hope for the best. For instance we
                        are using this parameter to call Google's experimental
                        browser, Canary, which performs better with demanding
                        visualizations. (default: None)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)
  --password-protected  If this flag is set, command line tool will ask you to
                        enter a password and interactive interface will be
                        only accessible after entering same password. This
                        option is recommended for shared machines like
                        clusters or shared networks where computers are not
                        isolated. (default: False)

anvi-display-pan

Start an anvi'o server to display a pan-genome

Example uses and other resources

Usage

usage: anvi-display-pan [-h] [-p PAN_DB] [-g GENOMES_STORAGE] [-d VIEW_DATA]
                        [-t NEWICK] [-V ADDITIONAL_VIEW]
                        [-A ADDITIONAL_LAYERS] [--view NAME] [--title NAME]
                        [--state-autoload NAME] [--collection-autoload NAME]
                        [--export-svg FILE_PATH] [--skip-init-functions]
                        [--dry-run] [--skip-auto-ordering] [--skip-news]
                        [-I IP_ADDR] [-P INT] [--browser-path PATH]
                        [--read-only] [--server-only] [--password-protected]
                        [--user-server-shutdown]

Parameters

INPUT FILES: Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

OPTIONAL INPUTS: Where the yay factor becomes a reality.

  -d VIEW_DATA, --view-data VIEW_DATA
                        A TAB-delimited file for view data (default: None)
  -t NEWICK, --tree NEWICK
                        NEWICK formatted tree structure (default: None)

ADDITIONAL STUFF: Parameters to provide additional layers, views, or layer data.

  -V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
                        A TAB-delimited file for an additional view to be used
                        in the interface. This file should contain all split
                        names, and values for each of them in all samples.
                        Each column in this file must correspond to a sample
                        name. Content of this file will be called 'user_view',
                        which will be available as a new item in the 'views'
                        combo box in the interface (default: None)
  -A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
                        A TAB-delimited file for additional layers for splits.
                        The first column of this file must be split names, and
                        the remaining columns should be unique attributes. The
                        file does not need to contain all split names, or
                        values for each split in every column. Anvi'o will try
                        to deal with missing data nicely. Each column in this
                        file will be visualized as a new layer in the tree.
                        (default: None)

VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.

  --view NAME           Start the interface with a pre-selected view. To see a
                        list of available views, use --show-views flag.
                        (default: None)
  --title NAME          Title for the interface. If you are working with a
                        RUNINFO dict, the title will be determined based on
                        information stored in that file. Regardless, you can
                        override that value using this parameter. (default:
                        None)
  --state-autoload NAME
                        Automatically load previous saved state and draw tree.
                        To see a list of available states, use --show-states
                        flag. (default: None)
  --collection-autoload NAME
                        Automatically load a collection and draw tree. To see
                        a list of available collections, use --list-
                        collections flag. (default: None)
  --export-svg FILE_PATH
                        The SVG output file path. (default: None)

SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).

  --skip-init-functions
                        When declared, function calls for genes will not be
                        initialized (therefore will be missing from all
                        relevant interfaces or output files). The use of this
                        flag may reduce the memory fingerprint and processing
                        time for large datasets. (default: False)
  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)
  --skip-auto-ordering  When declared, the attempt to include automatically
                        generated orders of items based on additional data is
                        skipped. In case those buggers cause issues with your
                        data, and you still want to see your stuff and deal
                        with the other issue maybe later. (default: False)
  --skip-news           Don't try to read news content from upstream.
                        (default: False)

SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --browser-path PATH   By default, anvi'o will use your default browser to
                        launch the interactive interface. If you would like to
                        use something else than your system default, you can
                        provide a full path for an alternative browser using
                        this parameter, and hope for the best. For instance we
                        are using this parameter to call Google's experimental
                        browser, Canary, which performs better with demanding
                        visualizations. (default: None)
  --read-only           When the interactive interface is started with this
                        flag, all 'database write' operations will be
                        disabled. (default: False)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)
  --password-protected  If this flag is set, command line tool will ask you to
                        enter a password and interactive interface will be
                        only accessible after entering same password. This
                        option is recommended for shared machines like
                        clusters or shared networks where computers are not
                        isolated. (default: False)
  --user-server-shutdown
                        Allow users to shutdown an anvi'server via web
                        interface. (default: False)

anvi-display-structure

Interactively visualize sequence variants on protein structures

Example uses and other resources

Usage

usage: anvi-display-structure [-h] -s STRUCTURE_DB [-p PROFILE_DB]
                              [-c CONTIGS_DB] [-V VARIABILITY_TABLE]
                              [--splits-of-interest FILE]
                              [--samples-of-interest FILE]
                              [--genes-of-interest FILE]
                              [--gene-caller-ids GENE_CALLER_IDS] [-j FLOAT]
                              [--SAAVs-only] [--SCVs-only] [-I IP_ADDR]
                              [-P INT] [--browser-path PATH] [--server-only]
                              [--password-protected]

Parameters

STRUCTURE: Information related to the structure database, which can be created with anvi-gen-structure-database.

  -s STRUCTURE_DB, --structure-db STRUCTURE_DB
                        Anvi'o structure database. (default: None)

VARIABILITY: We can overlay codon and amino acid variability in your metagenomes but we need a data source of this variability. Most simply, anvi'o can learn this information when you provide both your profile (-p) and contigs (-c) databases. Alternatively, you can provide a variability table output (-V) from the program anvi-gen-variability-profile. If you don't want to visualize variants, this is the wrong tool for the job. Instead, export the PDB files with anvi-export-structures, and open with a more comprehensive protein viewing software.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -V VARIABILITY_TABLE, --variability-profile VARIABILITY_TABLE
                        The output of anvi-gen-variability-profile, or a
                        different variant-calling output that has been
                        converted to the anvi'o format. (default: None)

REFINING PARAMETERS: Which samples, genes, and contigs etc. are you interested in? Define that stuff here.

  --splits-of-interest FILE
                        A file with split names. There should be only one
                        column in the file, and each line should correspond to
                        a unique split name. (default: None)
  --samples-of-interest FILE
                        A file with samples names. There should be only one
                        column in the file, and each line should correspond to
                        a unique sample name (without a column header).
                        (default: None)
  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  -j FLOAT, --min-departure-from-consensus FLOAT
                        Takes a value between 0 and 1, where 1 is maximum
                        divergence from the consensus. it can be an expensive
                        operation to display every variable position, and so
                        the default is 0.05. To display every variable
                        position, set this parameter to 0. (default: 0.05)
  --SAAVs-only          If provided, variability will be generated for single
                        amino acid variants (SAAVs) and not for single codon
                        variants (SCVs). This could save you some time if
                        you're only interested in SAAVs. (default: False)
  --SCVs-only           If provided, variability will be generated for single
                        codon variants (SCVs) and not for single amino acid
                        variants (SAAVs). This could save you some time if
                        you're only interested in SCVs. (default: False)

SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --browser-path PATH   By default, anvi'o will use your default browser to
                        launch the interactive interface. If you would like to
                        use something else than your system default, you can
                        provide a full path for an alternative browser using
                        this parameter, and hope for the best. For instance we
                        are using this parameter to call Google's experimental
                        browser, Canary, which performs better with demanding
                        visualizations. (default: None)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)
  --password-protected  If this flag is set, command line tool will ask you to
                        enter a password and interactive interface will be
                        only accessible after entering same password. This
                        option is recommended for shared machines like
                        clusters or shared networks where computers are not
                        isolated. (default: False)

anvi-estimate-genome-completeness

Estimate completion and redundancy using domain-specific single-copy core genes

Usage

usage: anvi-estimate-genome-completeness [-h] [-c CONTIGS_DB] [-e FILE_PATH]
                                         [-i FILE_PATH] [-p PROFILE_DB]
                                         [-C COLLECTION_NAME]
                                         [--list-collections] [--just-do-it]
                                         [--concise] [-o FILE_PATH]

Parameters

MANDATORY INPUT OPTION #1: Minimum input is an anvi'o contigs database. If you provide nothing else, anvi'o will assume that it is a single genome (even if it is not), and give you back what you need.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

MANDATORY INPUT OPTION #2: Or you can initiate this with an external genomes file.

  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)

ADDITIONAL INPUT (OPTIONAL): You can also give this program an anvi'o profile database along with a collection name. In which case anvi'o will estimate the completion and redundancy of every bin in this collection. Fun.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)

PARAMETERS OF CONVENIENCE: Because life is already very hard as it is.

  --list-collections    Show available collections and exit. (default: False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  --concise             Don't be verbose, print less messages whenever
                        possible. (default: False)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-estimate-metabolism

Reconstructs metabolic pathways and estimates pathway completeness for a given set of contigs

Usage

usage: anvi-estimate-metabolism [-h] [-c CONTIGS_DB] [-m]
                                [--kegg-data-dir KEGG_DATA_DIR]
                                [-p PROFILE_DB] [-C COLLECTION_NAME]
                                [-b BIN_NAME] [-B FILE_PATH] [-e FILE_PATH]
                                [-i FILE_PATH] [-M FILE_PATH]
                                [--module-completion-threshold NUM]
                                [-O FILENAME_PREFIX] [--include-zeros]
                                [--only-complete] [--kegg-output-modes MODES]
                                [--list-available-modes]
                                [--custom-output-headers HEADERS]
                                [--list-available-output-headers]
                                [--add-coverage] [--matrix-format]
                                [--include-metadata]
                                [--module-specific-matrices MODULE_LIST]
                                [--no-comments]
                                [--get-raw-data-as-json FILENAME_PREFIX]
                                [--store-json-without-estimation]
                                [--estimate-from-json FILE_PATH]

Parameters

INPUT #1 - ESTIMATION ON SINGLE GENOMES OR METAGENOMES: The minimum you must provide this program is a contigs database. In which case anvi'o will attempt to estimate metabolism for all contigs in it, assuming that the contigs database represents a single genome. If the contigs database is actually a metagenome, you should use the --metagenome flag to explicitly declare that.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -m, --metagenome-mode
                        Treat a given contigs database as a metagenome rather
                        than treating it as a single genome. (default: False)
  --kegg-data-dir KEGG_DATA_DIR
                        The directory path for your KEGG setup, which will
                        include things like KOfam profiles and KEGG MODULE
                        data. Anvi'o will try to use the default path if you
                        do not specify anything. (default: None)

INPUT #2 - ESTIMATION ON BINS: If you also provide a profile database AND a collection name, anvi'o will estimate metabolism separately for each bin in your collection. You can also limit those estimates to a specific bin or set of bins in the collection using the parameters --bin-id or --bin-ids-file, respectively.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)

INPUT #3 - MULTI-MODE: If you have multiple contigs databases to work with, you can put them all into a file. Then anvi'o will run estimation separately on each database and generate a single output file for all. There are 3 types of input files to choose from depending on whether you have single genomes (external), genomes in collections (internal), or metagenomes in your contigs DBs.

  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)
  -M FILE_PATH, --metagenomes FILE_PATH
                        A two-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name',
                        'contigs_db_path', and 'profile_db_path'. Each line
                        should list a single entry, where 'name' can be any
                        name to describe the metagenome stored in the anvi'o
                        contigs database. In this context, the anvi'o profiles
                        associated with contigs database must be SINGLE
                        PROFILES, as in generated by the program `anvi-
                        profile` and not `anvi-merge`. (default: None)

OUTPUT - GENERAL OPTIONS: Parameters for controlling estimation output of any type. The output will be TAB-delimited files which by default are prefixed with 'kegg- metabolism', but you can of course change that name here.

  --module-completion-threshold NUM
                        This threshold defines the point at which we consider
                        a KEGG module to be 'complete' or 'present' in a given
                        genome or bin. It is the fraction of steps that must
                        be complete in in order for the entire module to be
                        marked complete. The default is 0.75.
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  --include-zeros       If you use this flag, output files will include
                        modules with 0 percent completeness score, and in the
                        case of --matrix-format, output matrices will include
                        rows with 0s in every sample. (default: False)
  --only-complete       Choose this flag if you want only modules over the
                        module completeness threshold to be included in any
                        output files. (default: False)

OUTPUT - LONG-FORMAT OPTIONS: Parameters for controlling long-format output (the default).

  --kegg-output-modes MODES
                        Use this flag to indicate what information you want in
                        the kegg metabolism output files, by providing a
                        comma-separated list of output modes (each 'mode' you
                        provide will result in a different output file, all
                        with the same prefix). The default output modes are
                        'kofam_hits' and 'complete_modules'. To see a list of
                        available output modes, run this script with the flag
                        --list-available-modes. (default: None)
  --list-available-modes
                        Use this flag to see the available output modes and
                        their descriptions. (default: False)
  --custom-output-headers HEADERS
                        For use with the 'custom' output mode. Provide a
                        comma-separated list of headers to include in the
                        output matrix. To see a list of available headers, run
                        this script with the flag --list-available-output-
                        headers. (default: None)
  --list-available-output-headers
                        Use this flag to see the available output headers.
                        (default: False)
  --add-coverage        Use this flag to request that coverage and detection
                        values be added as columns in long-format output
                        files. You must provide the profile database
                        corresonding to your contigs db for this to work.
                        (default: False)

OUTPUT - MATRIX OPTIONS: Parameters for controlling matrix output. Use –matrix-format to request this type of output.

  --matrix-format       If you want to generate the output in several sparse
                        matrices instead of one file, use this flag. In each
                        matrix, contigs DBs will be arranged in columns and
                        KEGG modules in rows. This output option is especially
                        appropriate for input option #3. (default: False)
  --include-metadata    When asking for --matrix-format, you can use this flag
                        to make sure the output matrix files include columns
                        with metadata for each KEGG Module or KO (like the
                        module name and category for example) before the
                        sample columns. (default: False)
  --module-specific-matrices MODULE_LIST
                        Provide a comma-separated list of module numbers to
                        this parameter, and then you will get a KO hits matrix
                        for each module in the list. (default: None)
  --no-comments         If you are requesting --module-specific-matrices but
                        you don't want those matrices to include comment lines
                        in them (for example, perhaps you want to use them for
                        clustering), you can use this flag. Otherwise, by
                        default these specific matrices will include comments
                        delineating which KOs are in each step of the module.
                        (default: False)

DEBUG: Parameters to use if you think something fishy is going on or otherwise want to exert more control. Go for it.

  --get-raw-data-as-json FILENAME_PREFIX
                        If you want the raw metabolism estimation data
                        dictionary in JSON-format, provide a filename prefix
                        to this argument.The program will then output a file
                        with the .json extension containing this data.
                        (default: None)
  --store-json-without-estimation
                        This flag is used to control what is stored in the
                        JSON-formatted metabolism data dictionary. When this
                        flag is provided alongside the --get-raw-data-as-json
                        flag, the JSON file will be created without running
                        metabolism estimation, and that file will consequently
                        include only information about KOfam hits and gene
                        calls. The idea is that you can then modify this file
                        as you like and re-run this program using the flag
                        --estimate-from-json. (default: False)
  --estimate-from-json FILE_PATH
                        If you have a JSON file containing KOfam hits and gene
                        call information from your contigs database (such as a
                        file produced using the --get-raw-data-as-json flag),
                        you can provide that file to this flag and KEGG
                        metabolism estimates will be computed from the
                        information within instead of from a contigs database.
                        (default: None)

anvi-estimate-scg-taxonomy

Estimates taxonomy at genome and metagenome level. This program is the entry point to estimate taxonomy for a given set of contigs (i.e., all contigs in a contigs database, or contigs described in collections as bins). For this, it uses single-copy core gene sequences and the GTDB database

Example uses and other resources

Usage

usage: anvi-estimate-scg-taxonomy [-h] [-c CONTIGS_DB] [-m] [-p PROFILE_DB]
                                  [-C COLLECTION_NAME] [-M FILE_PATH]
                                  [-o FILE_PATH]
                                  [--per-scg-output-file FILE_PATH]
                                  [-O FILENAME_PREFIX]
                                  [--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
                                  [--matrix-format] [--raw-output]
                                  [-T NUM_THREADS] [-S SCG_NAME]
                                  [--report-scg-frequencies FILE_PATH]
                                  [--just-do-it]
                                  [--simplify-taxonomy-information]
                                  [--compute-scg-coverages]
                                  [--update-profile-db-with-taxonomy]
                                  [-r PATH]

Parameters

INPUT #1: The minimum you must provide this program is a contigs database. In which case anvi'o will attempt to estimate taxonomy for all contigs in it, assuming that the contigs database represents a single genome. If the contigs database is actually a metagenome, you should use the --metagenome flag to explicitly declare that.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -m, --metagenome-mode
                        Treat a given contigs database as a metagenome rather
                        than treating it as a single genome. (default: False)

INPUT #2: In addition, you can also point out a profile database. In which case you also must provide a collection name. When you do that anvi'o will offer taxonomy estimates for each bin in your collection.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)

INPUT #3: You can also work with a metagenomes file, assuming that you have multiple metagenomes with or without associated mapping results, and anvi'o would generate a singe output file for all.

  -M FILE_PATH, --metagenomes FILE_PATH
                        A two-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name',
                        'contigs_db_path', and 'profile_db_path'. Each line
                        should list a single entry, where 'name' can be any
                        name to describe the metagenome stored in the anvi'o
                        contigs database. In this context, the anvi'o profiles
                        associated with contigs database must be SINGLE
                        PROFILES, as in generated by the program `anvi-
                        profile` and not `anvi-merge`. (default: None)

OUTPUT AND FORMATTING: Anvi'o will do its best to offer you some fancy output tables for your viewing pleasure by default. But in addition to that, you can ask the resulting information to be stored in a TAB-delimited file (which is a much better way to include the results in your study as supplementary information, or work with these results using other analysis tools such as R). Depending on the mode you are running this program, anvi'o may ask you to use an 'output file prefix' rather than an 'output file path'.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --per-scg-output-file FILE_PATH
                        A more detailed output file that will describe
                        taxonomy of each scg in a single bin. When consensus
                        taxonomy is generated per bin or genome, taxonomy for
                        each underlying item is not reported. This additional
                        optional output file will elucidate things. (default:
                        None)
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  --taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
                        The taxonomic level to use whenever relevant and/or
                        available. The default taxonomic level is None, but if
                        you choose something specific, anvi'o will focus on
                        that whenever possible.
  --matrix-format       If you want the reports to look like sparse matrices
                        whenever possible, declare this flag. Matrices are
                        especially good to use when you are working with
                        internal/external genomes since they can show you
                        quickly the distribution of each taxon across all
                        metagenomes in programs like EXCEL. WELL TRY IT AND
                        SEE. (default: False)
  --raw-output          Just store the raw output without any processing of
                        the primary data structure. (default: False)

PERFORMANCE: We are not sure if allocating more threads for this operation will change anything. But hey. One can try.

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

AUTHORITY: Assert your dominance.

  -S SCG_NAME, --scg-name-for-metagenome-mode SCG_NAME
                        When running in metagenome mode, anvi'o automatically
                        chooses the most frequent single-copy core gene to
                        estimate the taxonomic composition within a contigs
                        database. If you have a different preference you can
                        use this parameter to communicate that. (default:
                        None)
  --report-scg-frequencies FILE_PATH
                        Report SCG frequencies in a TAB-delimited file and
                        quit. This is a great way to decide which SCG name to
                        use in metagenome mode (we often wish to use the most
                        frequent SCG to increase the detection of taxa).
                        (default: None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

ADVANCED: Very pro-like stuff.

  --simplify-taxonomy-information
                        The taxonomy output may include a large number of
                        names that contain clade-specific code for not-yet-
                        characterized taxa. With this flag you can simplify
                        taxon names. This will influence all output files and
                        displays as the use of this flag will on-the-fly trim
                        taxonomic levels with clade-specific code names.
                        (default: False)
  --compute-scg-coverages
                        When this flag is declared, anvi'o will go back to the
                        profile database to learn coverage statistics of
                        single-copy core genes for which we have taxonomy
                        information. (default: False)
  --update-profile-db-with-taxonomy
                        When anvi'o knows all both taxonomic affiliations and
                        coverages across samples for single-copy core genes,
                        it can, in theory add this information to the profile
                        database. With this flag you can instruct anvi'o to do
                        that and find information on taxonomy in the `layers`
                        tab of your interactive interface. (default: False)

BORING: Options that you will likely never need.

  -r PATH, --taxonomy-database PATH
                        Path to the directory that contains the BLAST
                        databases for single-copy core genes. You will almost
                        never need to use this parameter unless you are trying
                        something very fancy. But when you do, you can tell
                        anvi'o where to look for database files through this
                        parameter. (default: None)

anvi-estimate-trna-taxonomy

Estimates taxonomy at genome and metagenome level using tRNA sequences.

Usage

usage: anvi-estimate-trna-taxonomy [-h] [-c CONTIGS_DB] [-m] [-p PROFILE_DB]
                                   [-C COLLECTION_NAME] [-M FILE_PATH]
                                   [--dna-sequence DNA SEQ]
                                   [--max-num-target-sequences NUMBER]
                                   [-o FILE_PATH]
                                   [--per-anticodon-output-file FILE_PATH]
                                   [-O FILENAME_PREFIX]
                                   [--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
                                   [--matrix-format] [--raw-output]
                                   [-T NUM_THREADS] [-S ANTICODON]
                                   [--report-anticodon-frequencies FILE_PATH]
                                   [--just-do-it]
                                   [--simplify-taxonomy-information]
                                   [--compute-anticodon-coverages]
                                   [--update-profile-db-with-taxonomy]
                                   [-r PATH]

Parameters

INPUT #1: The minimum you must provide this program is a contigs database. In which case anvi'o will attempt to estimate taxonomy for all contigs in it, assuming that the contigs database represents a single genome. If the contigs database is actually a metagenome, you should use the --metagenome flag to explicitly declare that.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -m, --metagenome-mode
                        Treat a given contigs database as a metagenome rather
                        than treating it as a single genome. (default: False)

INPUT #2: In addition, you can also point out a profile database. In which case you also must provide a collection name. When you do that anvi'o will offer taxonomy estimates for each bin in your collection.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)

INPUT #3: You can also work with a metagenomes file, assuming that you have multiple metagenomes with or without associated mapping results, and anvi'o would generate a singe output file for all.

  -M FILE_PATH, --metagenomes FILE_PATH
                        A two-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name',
                        'contigs_db_path', and 'profile_db_path'. Each line
                        should list a single entry, where 'name' can be any
                        name to describe the metagenome stored in the anvi'o
                        contigs database. In this context, the anvi'o profiles
                        associated with contigs database must be SINGLE
                        PROFILES, as in generated by the program `anvi-
                        profile` and not `anvi-merge`. (default: None)

INPUT #4: Ad hoc sequence search. No contigs databases, no profiles. The lazy stayla. Please note that if you use parameters defined under this optin, none of the other standard parameters for this program will be taken into consideration.

  --dna-sequence DNA SEQ
                        Literally a DNA sequence. For the very lazy. (default:
                        None)
  --max-num-target-sequences NUMBER
                        Maximum number of target sequences to request from
                        BLAST or DIAMOND searches. The default is 20%.

OUTPUT AND FORMATTING: Anvi'o will do its best to offer you some fancy output tables for your viewing pleasure by default. But in addition to that, you can ask the resulting information to be stored in a TAB-delimited file (which is a much better way to include the results in your study as supplementary information, or work with these results using other analysis tools such as R). Depending on the mode you are running this program, anvi'o may ask you to use an 'output file prefix' rather than an 'output file path'.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --per-anticodon-output-file FILE_PATH
                        A more detailed output file that will describe
                        taxonomy of each anticodon in a single bin. When
                        consensus taxonomy is generated per bin or genome,
                        taxonomy for each underlying item is not reported.
                        This additional optional output file will elucidate
                        things. (default: None)
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  --taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
                        The taxonomic level to use whenever relevant and/or
                        available. The default taxonomic level is None, but if
                        you choose something specific, anvi'o will focus on
                        that whenever possible.
  --matrix-format       If you want the reports to look like sparse matrices
                        whenever possible, declare this flag. Matrices are
                        especially good to use when you are working with
                        internal/external genomes since they can show you
                        quickly the distribution of each taxon across all
                        metagenomes in programs like EXCEL. WELL TRY IT AND
                        SEE. (default: False)
  --raw-output          Just store the raw output without any processing of
                        the primary data structure. (default: False)

PERFORMANCE: We are not sure if allocating more threads for this operation will change anything. But hey. One can try.

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

AUTHORITY: Assert your dominance.

  -S ANTICODON, --anticodon-for-metagenome-mode ANTICODON
                        When running in metagenome mode, anvi'o automatically
                        chooses the most frequent anticodon to estimate the
                        taxonomic composition within a contigs database. If
                        you have a different preference you can use this
                        parameter to communicate that. (default: None)
  --report-anticodon-frequencies FILE_PATH
                        Report anticodon frequencies in a TAB-delimited file
                        and quit. This is a great way to decide which
                        anticodon to use in metagenome mode (we often wish to
                        use the most frequent anticodon to increase the
                        detection of taxa). (default: None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

ADVANCED: Very pro-like stuff.

  --simplify-taxonomy-information
                        The taxonomy output may include a large number of
                        names that contain clade-specific code for not-yet-
                        characterized taxa. With this flag you can simplify
                        taxon names. This will influence all output files and
                        displays as the use of this flag will on-the-fly trim
                        taxonomic levels with clade-specific code names.
                        (default: False)
  --compute-anticodon-coverages
                        When this flag is declared, anvi'o will go back to the
                        profile database to learn coverage statistics of tRNA
                        genes used for taxonomy. (default: False)
  --update-profile-db-with-taxonomy
                        When anvi'o knows all both taxonomic affiliations and
                        coverages across samples for single-copy core genes,
                        it can, in theory add this information to the profile
                        database. With this flag you can instruct anvi'o to do
                        that and find information on taxonomy in the `layers`
                        tab of your interactive interface. (default: False)

BORING: Options that you will likely never need.

  -r PATH, --taxonomy-database PATH
                        Path to the directory that contains the BLAST
                        databases for single-copy core genes. You will almost
                        never need to use this parameter unless you are trying
                        something very fancy. But when you do, you can tell
                        anvi'o where to look for database files through this
                        parameter. (default: None)

anvi-experimental-organization

Create an experimental clustering dendrogram.

Example uses and other resources

Usage

usage: anvi-experimental-organization [-h] [-p PROFILE_DB] -c CONTIGS_DB
                                      [-i DIR_PATH] [-N NAME]
                                      [--distance DISTANCE_METRIC]
                                      [--linkage LINKAGE_METHOD]
                                      [--skip-store-in-db] [-o FILE_PATH]
                                      [--dry-run]
                                      FILE

Parameters

positional arguments:

  FILE                  Config file for clustering of contigs. See
                        documentation for help.

optional arguments:

  -h, --help            show this help message and exit
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -i DIR_PATH, --input-directory DIR_PATH
                        Input directory where the input files addressed from
                        the configuration file can be found (i.e., the profile
                        database, if PROFILE.db::TABLE notation is used in the
                        configuration file). (default: None)
  -N NAME, --name NAME  The name to use when storing the resulting clustering
                        in the database. This name will appear in the
                        interactive interface and other relevant interfaces.
                        Please consider using a short and descriptive single-
                        word (if you do not do that you will make anvi'o
                        complain). (default: None)
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        If you do not use this flag, the distance metric you
                        defined in your clustering config file will be used.
                        If you have not defined one in your config file, then
                        the system default will be used, which is "euclidean".
                        (default: None)
  --linkage LINKAGE_METHOD
                        Same story with the `--distance`, except, the system
                        default for this one is ward. (default: None)
  --skip-store-in-db    By default, analysis results are stored in the profile
                        database. The use of this flag will let you skip that
                        (default: False)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)

anvi-export-collection

Export a collection from an anvi'o database

Usage

usage: anvi-export-collection [-h] -p PAN_OR_PROFILE_DB [-C COLLECTION_NAME]
                              [-O FILENAME_PREFIX] [--list-collections]
                              [--include-unbinned]

Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  --list-collections    Show available collections and exit. (default: False)
  --include-unbinned    When this flag is used, anvi'o will also store in the
                        output file the items that do not appear in any of
                        your bins. This new bin will be called
                        'UNBINNED_ITEMS_BIN'. Yes. The ugly name is
                        intentional. (default: False)

anvi-export-contigs

Export contigs (or splits) from an anvi'o contigs database

Usage

usage: anvi-export-contigs [-h] -c CONTIGS_DB [--contigs-of-interest FILE]
                           [--splits-mode] -o FILE_PATH [--just-do-it]
                           [--no-wrap]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --contigs-of-interest FILE
                        It is possible to focus on only a set of contigs. If
                        you would like to do that and ignore the rest of the
                        contigs in your contigs database, use this parameter
                        with a flat file every line of which desribes a single
                        contig name. (default: None)
  --splits-mode         Export split sequences instead. (default: False)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  --no-wrap             Do not be wrap sequences nicely in the output file.
                        (default: False)

anvi-export-functions

Export functions of genes from an anvi'o contigs database for a given annotation source

Usage

usage: anvi-export-functions [-h] -c CONTIGS_DB [-o FILE_PATH]
                             [--annotation-sources SOURCE NAME[S]] [-l]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --annotation-sources SOURCE NAME[S]
                        Get functional annotations for a specific list of
                        annotation sources. You can specify one or more
                        sources by separating them from each other with a
                        comma character (i.e., '--annotation-sources
                        source_1,source_2,source_3'). The default behavior is
                        to return everything (default: None)
  -l, --list-annotation-sources
                        List available functional annotation sources.
                        (default: False)

anvi-export-gene-calls

Export gene calls from an anvi'o contigs database

Usage

usage: anvi-export-gene-calls [-h] -c CONTIGS_DB [-o FILE_PATH]
                              [--gene-caller GENE-CALLER]
                              [--list-gene-callers]
                              [--skip-sequence-reporting]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --gene-caller GENE-CALLER
                        Which gene caller(s) would you like to export gene
                        calls for? If providing multiple they should be comma-
                        separated (no spaces). If you don't know, use --list-
                        gene-callers (default: None)
  --list-gene-callers   List available gene callers in the contigs database
                        and quit. (default: False)
  --skip-sequence-reporting
                        By default, exported gene calls have an amino acid
                        sequences column in the output. Turn this behavior off
                        with this flag (default: False)

anvi-export-gene-coverage-and-detection

Export gene coverage and detection data for all genes associated with contigs described in a profile database

Usage

usage: anvi-export-gene-coverage-and-detection [-h] -p PROFILE_DB -c
                                               CONTIGS_DB -O FILENAME_PREFIX
                                               [--gene-caller-id GENE_CALLER_ID]
                                               [--genes-of-interest FILE]

Parameters

DATABASES: Anvi'o databases to read from

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

OUTPUT: Define a prefix for your output files

  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)

GENES: Gene calls you want to work with. Without these parameters anvi'o will report everything it finds in the profile database (please) note that the reported genes will only include those that occur in contigs that were taken into consideration during anvi-profile, which means if there was a length cutoff for profiling, genes that occur in contigs shorter than that cutoff will not appear in your output.

  --gene-caller-id GENE_CALLER_ID
                        A single gene id. (default: None)
  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)

anvi-export-items-order

Export an item order from an anvi'o database

Usage

usage: anvi-export-items-order [-h] [-p DB PATH] [--name ORDER NAME]
                               [-o FILE_PATH]

Parameters

INPUT: The database and the items order of interest

  -p DB PATH, --db-path DB PATH
                        An appropriate anvi'o database. (default: None)
  --name ORDER NAME     The name of the order you want to export. If you don't
                        provide an order name, anvi'o will show you the names
                        of all available orders in the database. (default:
                        None)

OUPUT: Output file name and stuff

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-export-locus

This program helps you cut a 'locus' from a larger genetic context (e.g., contigs, genomes). By default, anvi'o will locate a user-defined anchor gene, extend its selection upstream and downstream based on the –num-genes argument, then extract the locus to create a new contigs database. The anchor gene must be provided as –search-term, –gene-caller-ids, or –hmm-sources. If –flank-mode is designated, you MUST provide TWO flanking genes that define the locus region (Please see –flank-mode help for more information). If everything goes as plan, anvi'o will give you individual locus contigs databases for every matching anchor gene found in the original contigs database provided. Enjoy your mini contigs databases!

Usage

usage: anvi-export-locus [-h] -c CONTIGS_DB [-s SEARCH_TERM]
                         [--gene-caller-ids GENE_CALLER_IDS]
                         [--delimiter CHAR] [-o DIR_PATH] -O FILENAME_PREFIX
                         [--flank-mode] [-n NUM_GENES] [--use-hmm]
                         [--hmm-sources SOURCE NAME] [-l]
                         [--annotation-sources SOURCE NAME[S]] [-W]
                         [--remove-partial-hits] [--never-reverse-complement]

Parameters

Essential INPUT:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

Query options for locating locus: search according to either hmm or functional annotations

  -s SEARCH_TERM, --search-term SEARCH_TERM
                        search term. (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  --delimiter CHAR      The delimiter to parse multiple input terms. The
                        default is ','.

THE OUTPUT: Where should the output go. It will be one FASTA file with all matches or one FASTA per match (see –separate-fasta)

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)

ADDITIONAL STUFF: Flags and parameters you can set according to your need

  --flank-mode          If in --flank-mode, anvi-export-locus will extract a
                        locus based on the coordinates of flanking genes. You
                        MUST provide 2 flanking genes in the form of TWO
                        --search-term, --gene-caller-ids, or --hmm-sources.
                        The --flank-mode option is appropriate for extracting
                        loci of variable gene number lengths, but are
                        consistently located between the same flanking genes
                        in the genome(s) of interest. (default: False)
  -n NUM_GENES, --num-genes NUM_GENES
                        Required for DEFAULT mode. For each match (to the
                        function, or HMM that was searched) a sequence which
                        includes a block of genes will be saved. The block
                        could include either genes only in the forward
                        direction of the gene (defined according to the
                        direction of transcription of the gene) or reverse or
                        both. If you wish to get both direction use a comma
                        (no spaces) to define the block For example, '-n 4,5'
                        will give you four genes before and five genes after.
                        Whereas, '-n 5' will give you five genes after (in
                        addition to the gene that matched). To get only genes
                        preceding the match use '-n 5,0'. If the number of
                        genes requested exceeds the length of the contig, then
                        the output will include the sequence until the end of
                        the contig. (default: None)
  --use-hmm             Use HMM hits instead of functional annotations. In
                        other words, --search-term will be queried against HMM
                        source annotations, NOT functional annotations. If you
                        choose this option, you must also say which HMM source
                        to use. (default: False)
  --hmm-sources SOURCE NAME
                        Get sequences for a specific list of HMM sources. You
                        can list one or more sources by separating them from
                        each other with a comma character (i.e., '--hmm-
                        sources source_1,source_2,source_3'). If you would
                        like to see a list of available sources in the contigs
                        database, run this program with '--list-hmm-sources'
                        flag. (default: None)
  -l, --list-hmm-sources
                        List available HMM sources in the contigs database and
                        quit. (default: False)
  --annotation-sources SOURCE NAME[S]
                        Get functional annotations for a specific list of
                        annotation sources. You can specify one or more
                        sources by separating them from each other with a
                        comma character (i.e., '--annotation-sources
                        source_1,source_2,source_3'). The default behavior is
                        to return everything (default: None)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)
  --remove-partial-hits
                        By default anvi'o will return hits even if they are
                        partial. Declaring this flag will make anvi'o filter
                        all hits that are partial. Partial hits are hits in
                        which you asked for n1 genes before and n2 genes after
                        the gene that matched the search criteria but the
                        search hits the end of the contig before finding the
                        number of genes that you asked. (default: False)
  --never-reverse-complement
                        By default, if a gene that is found by the search
                        criteria is reverse in it's direction, then the
                        sequence of the entire locus is reversed before it is
                        saved to the output. If you wish to prevent this
                        behavior then use the flag --never-reverse-complement.
                        (default: False)

anvi-export-misc-data

Export additional data or order tables in pan or profile databases for items or layers

Example uses and other resources

Usage

usage: anvi-export-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB] -t
                             NAME [-D NAME] [-o FILE_PATH]

Parameters

Database input: Provide 1 of these

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

Details: Everything else.

  -t NAME, --target-data-table NAME
                        The target table is the table you are interested in
                        accessing. Currently it can be 'items','layers', or
                        'layer_orders'. Please see most up-to-date online
                        documentation for more information. (default: None)
  -D NAME, --target-data-group NAME
                        Data group to focus. Anvi'o misc data tables support
                        associating a set of data keys with a data group. If
                        you have no idea what this is, then probably you don't
                        need it, and anvi'o will take care of you. Note: this
                        flag is IRRELEVANT if you are working with additional
                        order data tables. (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-export-splits-and-coverages

Export split or contig sequences and coverages across samples stored in an anvi'o profile database. This program is especially useful if you would like to 'bin' your splits or contigs outside of anvi'o and import the binning results into anvi'o using anvi-import-collection program

Usage

usage: anvi-export-splits-and-coverages [-h] -p PROFILE_DB -c CONTIGS_DB
                                        [-o DIR_PATH] [-O FILENAME_PREFIX]
                                        [--splits-mode] [--report-contigs]
                                        [--use-Q2Q3-coverages]

Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  --splits-mode         Specify this flag if you would like to output
                        coverages of individual 'splits', rather than their
                        'parent' contig coverages. (default: False)
  --report-contigs      By default this program reports sequences and their
                        coverages for 'splits'. By using this flag, you can
                        report contig sequences and coverages instead. For
                        obvious reasons, you can't use this flag with
                        `--splits-mode` flag. (default: False)
  --use-Q2Q3-coverages  By default this program reports the mean coverage of a
                        split (or contig, see --report-contigs) for each
                        sample. By using this flag, you can report the mean
                        Q2Q3 coverage by excluding 25 percent of the
                        nucleotide positions with the smallest coverage
                        values, and 25 percent of the nucleotide positions
                        with the largest coverage values. The hope is that
                        this removes 'outlier' positions resulting from non-
                        specific mapping, etc. that skew the mean coverage
                        estimate. (default: False)

anvi-export-splits-taxonomy

Export taxonomy for splits found in an anvi'o contigs database

Usage

usage: anvi-export-splits-taxonomy [-h] -c CONTIGS_DB -o FILE_PATH

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-export-state

Export an anvi'o state into a profile database

Usage

usage: anvi-export-state [-h] -p PAN_OR_PROFILE_DB [-o FILE_PATH]
                         [-s STATE_NAME] [--list-states]

Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  -s STATE_NAME, --state STATE_NAME
                        The state name to export. (default: None)
  --list-states         Show available states and exit. (default: False)

anvi-export-structures

Export .pdb structure files from a structure database

Usage

usage: anvi-export-structures [-h] -s STRUCTURE_DB [-o DIR_PATH]
                              [--gene-caller-ids GENE_CALLER_IDS]
                              [--genes-of-interest FILE]

Parameters

optional arguments:

  -s STRUCTURE_DB, --structure-db STRUCTURE_DB
                        Anvi'o structure database. (default: None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)

anvi-export-table

Export anvi'o database tables as TAB-delimited text files

Usage

usage: anvi-export-table [-h] [--table TABLE_NAME] [-l] [-f FIELDS]
                         [-o FILE_PATH]
                         DB

Parameters

positional arguments:

  DB                    Anvi'o database to read from.

optional arguments:

  -h, --help            show this help message and exit
  --table TABLE_NAME    Table name to export. (default: None)
  -l, --list            Gives a list of tables in a database and quits. If a
                        table is already declared this time it lists all the
                        fields in a given table, in case you would to export
                        only a specific list of fields from the table using
                        --fields parameter. (default: False)
  -f FIELD(S), --fields FIELD(S)
                        Fields to report. Use --list-tables parameter with a
                        table name to see available fields You can list fields
                        using this notation: --fields 'field_1, field_2, ...
                        field_N'. (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-gen-contigs-database

Generate a new anvi'o contigs database

Usage

usage: anvi-gen-contigs-database [-h] -f FASTA [-n PROJECT_NAME]
                                 [-T NUM_THREADS] [-o DB_FILE_PATH]
                                 [--db-variant VARIANT]
                                 [--description TEXT_FILE] [-L INT]
                                 [--skip-mindful-splitting] [-K INT]
                                 [--skip-gene-calling]
                                 [--prodigal-translation-table INT]
                                 [--external-gene-calls GENE-CALLS]
                                 [--ignore-internal-stop-codons]
                                 [--skip-predict-frame]

Parameters

MANDATORY INPUTS: Things you really need to provide to be in business.

  -f FASTA, --contigs-fasta FASTA
                        The FASTA file that contains reference sequences you
                        mapped your samples against. This could be a reference
                        genome, or contigs from your assembler. Contig names
                        in this file must match to those in other input files.
                        If there is a problem anvi'o will gracefully complain
                        about it. (default: None)
  -n PROJECT_NAME, --project-name PROJECT_NAME
                        Name of the project. Please choose a short but
                        descriptive name (so anvi'o can use it whenever she
                        needs to name an output file, or add a new table in a
                        database, or name her first born). (default: None)

PERFORMANCE: You have multiple cores? WELL, USE THEM MAYBE?.

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

OPTIONAL INPUTS: Things you may want to tweak.

  -o DB_FILE_PATH, --output-db-path DB_FILE_PATH
                        Output file path for the new database. (default:
                        CONTIGS.db)
  --db-variant VARIANT  A free-form text variable to associate a database with
                        a variant for power users and/or programmers. Please
                        leave this blank unless you are certain that you need
                        to set a db variant since it may influence downstream
                        processes. In an ideal world a variant would be a
                        single-word, without any capitalized letters or
                        special characters. (default: unknown)
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)
  -L INT, --split-length INT
                        Anvi'o splits very long contigs into smaller pieces,
                        without actually splitting them for real. These
                        'virtual' splits improve the efficacy of the
                        visualization step, and changing the split size gives
                        freedom to the user to adjust the resolution of their
                        display when necessary. The default value is (20000).
                        If you are planning to use your contigs database for
                        metagenomic binning, we advise you to not go below
                        10,000 (since the lower the split size is, the more
                        items to show in the display, and decreasing the split
                        size does not really help much to binning). But if you
                        are thinking about using this parameter for ad hoc
                        investigations other than binning, you should ignore
                        our advice, and set the split size as low as you want.
                        If you do not want your contigs to be split, you can
                        set the split size to '0' or any other negative
                        integer (lots of unnecessary freedom here, enjoy!).
  --skip-mindful-splitting
                        By default, anvi'o attempts to prevent soft-splitting
                        large contigs by cutting proper gene calls to make
                        sure a single gene is not broken into multiple splits.
                        This requires a careful examination of where genes
                        start and end, and to find best locations to split
                        contigs with respect to this information. So, when the
                        user asks for a split size of, say, 1,000, it serves
                        as a mere suggestion. When this flag is used, anvi'o
                        does what the user wants and creates splits at desired
                        lengths (although some functionality may become
                        unavailable for the projects that rely on a contigs
                        database that is initiated this way). (default: False)
  -K INT, --kmer-size INT
                        K-mer size for k-mer frequency calculations. The
                        default k-mer size for composition-based analyses is
                        4, historically. Although tetra-nucleotide frequencies
                        seem to offer the the sweet spot of sensitivity,
                        information density, and manageable number of
                        dimensions for clustering approaches, you are welcome
                        to experiment (but maybe you should leave it as is for
                        your first set of analyses). (default: 4)

GENES IN CONTIGS: Expert thingies.

  --skip-gene-calling   By default, generating an anvi'o contigs database
                        includes the identification of open reading frames in
                        contigs by running a bacterial gene caller. Declaring
                        this flag will by-pass that process. If you prefer,
                        you can later import your own gene calling results
                        into the database. (default: False)
  --prodigal-translation-table INT
                        This is a parameter to pass to the Prodigal for a
                        specific translation table. This parameter corresponds
                        to the parameter `-g` in Prodigal, the default value
                        of which is 11 (so if you do not set anything, it will
                        be set to 11 in Prodigal runtime. Please refer to the
                        Prodigal documentation to determine what is the right
                        translation table for you if you think you need it.)
                        (default: None)
  --external-gene-calls GENE-CALLS
                        A TAB-delimited file to define external gene calls.
                        The file must have these columns: 'gene_callers_id' (a
                        unique integer number for each gene call, start from
                        1), 'contig' (the contig name the gene call is found),
                        'start' (start position, integer), 'stop' (stop
                        position, integer), 'direction' (the direction of the
                        gene open reading frame; can be 'f' or 'r'), 'partial'
                        (whether it is a complete gene call, or a partial one;
                        must be 1 for partial calls, and 0 for complete
                        calls), 'call_type' (1 if it is coding, 2 if it is
                        noncoding, or 3 if it is unknown (only gene calls with
                        call_type = 1 will have amino acid sequences
                        translated)), 'source' (the gene caller), and
                        'version' (the version of the gene caller, i.e.,
                        v2.6.7 or v1.0). An additional 'optional' column is
                        'aa_sequence' to explicitly define the amino acid
                        seqeuence of a gene call so anvi'o does not attempt to
                        translate the DNA sequence itself. An EXAMPLE FILE
                        (with the optional 'aa_sequence' column (so feel free
                        to take it out for your own case)) can be found at the
                        URL https://bit.ly/2qEEHuQ. If you are providing
                        external gene calls, please also see the flag `--skip-
                        predict-frame`. (default: None)
  --ignore-internal-stop-codons
                        This is only relevant when you have an external gene
                        calls file. If anvi'o figures out that your custom
                        gene calls result in amino acid sequences with stop
                        codons in the middle, it will complain about it. You
                        can use this flag to tell anvi'o to don't check for
                        internal stop codons, Even though this shouldn't
                        happen in theory, we understand that it almost always
                        does. In these cases, anvi'o understands that
                        sometimes we don't want to care, and will not judge
                        you. Instead, it will replace every stop codon residue
                        in the amino acid sequence with an 'X' character.
                        Please let us know if you used this and things failed,
                        so we can tell you that you shouldn't have really used
                        it if you didn't like failures at the first place
                        (smiley). (default: False)
  --skip-predict-frame  When you have provide an external gene calls file,
                        anvi'o will predict the correct frame for gene calls
                        as best as it can by using a previously-generated
                        Markov model that is trained using the uniprot50
                        database (see this for details:
                        https://github.com/merenlab/anvio/pull/1428), UNLESS
                        there is an `aa_sequence` entry for a given gene call
                        in the external gene calls file. Please note that
                        PREDICTING FRAMES MAY CHANGE START/STOP POSITIONS OF
                        YOUR GENE CALLS SLIGHTLY, if those that are in the
                        external gene calls file are not describing proper
                        gene calls according to the model. If you use this
                        flag, anvi'o will not rely on any model and will
                        attempt to translate your DNA sequences by solely
                        relying upon start/stop positions in the file, but it
                        will complain about sequences start/stop positions of
                        which are not divisible by 3. (default: False)

anvi-gen-fixation-index-matrix

Generate a pairwise matrix of a fixation indices between samples

Example uses and other resources

Usage

usage: anvi-gen-fixation-index-matrix [-h] [-p PROFILE_DB] [-c CONTIGS_DB]
                                      [-s STRUCTURE_DB] [-V VARIABILITY_TABLE]
                                      [-C COLLECTION_NAME] [-b BIN_NAME]
                                      [--splits-of-interest FILE]
                                      [--genes-of-interest FILE]
                                      [--gene-caller-ids GENE_CALLER_IDS]
                                      [--only-if-structure]
                                      [--samples-of-interest FILE]
                                      [--engine ENGINE]
                                      [--min-coverage-in-each-sample INT]
                                      [-o FIXATION_INDICES] [--keep-negatives]

Parameters

DATABASES: Declaring relevant anvi'o databases. First things first. Some are mandatory, some are optional.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -s STRUCTURE_DB, --structure-db STRUCTURE_DB
                        Anvi'o structure database. (default: None)
  -V VARIABILITY_TABLE, --variability-profile VARIABILITY_TABLE
                        The output of anvi-gen-variability-profile, or a
                        different variant-calling output that has been
                        converted to the anvi'o format. (default: None)

FOCUS :: BIN: You need to pick someting to focus. You can ask anvi'o to work with a bin in a collection.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)

FOCUS :: SPLIT NAMES: Alternatively you can declare split names to focus.

  --splits-of-interest FILE
                        A file with split names. There should be only one
                        column in the file, and each line should correspond to
                        a unique split name. (default: None)

FOCUS :: GENE CALLER IDs: Alternatively you can declare gene caller IDs to focus.

  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  --only-if-structure   If provided, your genes of interest will be further
                        subset to only include genes with structures in your
                        structure database, and therefore must be supplied in
                        conjunction with a structure database, i.e. `-s
                        <your_structure_database>`. If you did not specify
                        genes of interest, ALL genes will be subset to those
                        that have structures. (default: False)

SAMPLES: You can ask anvi'o to focus only on a subset of samples.

  --samples-of-interest FILE
                        A file with samples names. There should be only one
                        column in the file, and each line should correspond to
                        a unique sample name (without a column header).
                        (default: None)

ENGINE: Set your engine. This is important as it will define the output profile you will get from this program. The engine can focus on nucleotides (NT), codons (CDN), or an amino acids (AA).

  --engine ENGINE       Variability engine. The default is 'NT'.

FILTERS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…

  --min-coverage-in-each-sample INT
                        Minimum coverage of a given variable nucleotide
                        position in all samples. If a nucleotide position is
                        covered less than this value even in one sample, it
                        will be removed from the analysis. Default is 0.

OUTPUT: Output file and style

  -o FIXATION_INDICES, --output-file FIXATION_INDICES
                        File path to store results. (default:
                        fixation_indices.txt)

EXTRAS: Because why not be extra?

  --keep-negatives      Negative numbers are theoretically possible, and are
                        sometimes interpreted as out-breeding. By default, we
                        set negative numbers to 0 so the results are
                        reflective of a standard distance metric. Provide this
                        flag if you would prefer otherwise. (default: False)

anvi-gen-gene-consensus-sequences

Collapse variability for a set of genes across samples

Usage

usage: anvi-gen-gene-consensus-sequences [-h] -p PROFILE_DB -c CONTIGS_DB
                                         [--gene-caller-ids GENE_CALLER_IDS]
                                         [--genes-of-interest FILE]
                                         [--samples-of-interest FILE]
                                         [-o FILE_PATH] [--tab-delimited]
                                         [--engine ENGINE] [--contigs-mode]
                                         [--quince-mode] [--compress-samples]

Parameters

optional arguments:

  --compress-samples    Normally all samples with variation will have their
                        own consensus sequence. If this flag is provided, the
                        coverages from each sample of interest will be summed
                        and only a single consenus sequence for each
                        gene/contig will be output. (default: False)

DATABASES: Declaring relevant anvi'o databases. First things first.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

FOCUS: What do we want? A consensus sequence for a gene, or a list of genes. From where do we want it? All samples, by default. When do we want it? Whenever it is convenient.

  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)
  --samples-of-interest FILE
                        A file with samples names. There should be only one
                        column in the file, and each line should correspond to
                        a unique sample name (without a column header).
                        (default: None)

OUTPUT: Output file and output style

  -o FILE_PATH, --output-file FILE_PATH
                        The output file name. The boring default is
                        "genes.fa". You can change the output file format to a
                        TAB-delimited file using teh flag `--tab-delimited`,
                        in which case please do not forget to change the file
                        name, too.
  --tab-delimited       Use the TAB-delimited format for the output file.
                        (default: False)

EXTRAS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…

  --engine ENGINE       Variability engine. The default is 'NT'.
  --contigs-mode        Use this flag to output consensus sequences of
                        contigs, instead of the default, which is genes
                        (default: False)
  --quince-mode         Use this flag to output consensus sequences for cases
                        even where there is no variability (default: False)

anvi-gen-gene-level-stats-databases

A program to compute genes databases for a ginen set of bins stored in an anvi'o collection. Genes databases store gene-level coverage and detection statistics, and they are usually computed and generated automatically when they are required (such as running anvi-interactive with --gene-mode flag). This program allows you to pre-compute them if you don't want them to be done all at once

Usage

usage: anvi-gen-gene-level-stats-databases [-h] -c CONTIGS_DB -p PROFILE_DB
                                           [-C COLLECTION_NAME] [-b BIN_NAME]
                                           [-B FILE_PATH]
                                           [--zeros-are-outliers]
                                           [--outliers-threshold NUM]
                                           [--just-do-it] [--inseq-stats]

Parameters

INPUT DATABASES: Which anvi'o databases do you wish to work today?

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)

BIN(S) AND COLLECTION: You can select a bin, multiple bins, or you can simply focus on every bin in a collection by providing only a collection name. Once you are done with your selection, anvi'o will generate an individual genes database for each of the bin it finds.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)

ADDITIONAL PARAMETERS: These parameters are those that are critical to identify outlier nucleotide positions and how to define what should be included in those calculations. In most cases you can leave them as is, and things are going to be alright.

  --zeros-are-outliers  If you want all zero coverage positions to be treated
                        like outliers then use this flag. The reason to treat
                        zero coverage as outliers is because when mapping
                        reads to a reference we could get many zero positions
                        due to accessory genes. These positions then skew the
                        average values that we compute. (default: False)
  --outliers-threshold NUM
                        Threshold to use for the outlier detection. The
                        default value is '1.5'. Absolute deviation around the
                        median is used. To read more about the method please
                        refer to: 'How to Detect and Handle Outliers' by Boris
                        Iglewicz and David Hoaglin
                        (doi:10.1016/j.jesp.2013.03.013).

PARAMETERS OF CONVENIENCE: They say they save lives.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

INSEQ DATA: When analyzing INSeq/Tn-Seq data

  --inseq-stats         Provide if working with INSeq/Tn-Seq genomic data.
                        With this, all gene level coverage stats will be
                        calculated using INSeq/Tn-Seq statistical methods.
                        (default: False)

anvi-gen-genomes-storage

Create a genome storage from internal and/or external genomes for a pangenome analysis

Example uses and other resources

Usage

usage: anvi-gen-genomes-storage [-h] [-e FILE_PATH] [-i FILE_PATH]
                                [--gene-caller GENE-CALLER] -o GENOMES_STORAGE

Parameters

EXTERNAL GENOMES: External genomes listed as anvi'o contigs databases. As in, you have one or more genomes say from NCBI you want to work with, and you created an anvi'o contigs database for each one of them.

  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)

INTERNAL GENOMES: Genome bins stored in an anvi'o profile databases as collections.

  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)

PRO STUFF: Things you may not have to change. But you never know (unless you read the help).

  --gene-caller GENE-CALLER
                        The gene caller to utilize. Anvi'o supports multiple
                        gene callers, and some operations (including this one)
                        requires an explicit mentioning of which one to use.
                        The default prodigal is but it will not be enough if
                        you were experiencing your rebelhood as you should,
                        and have generated your contigs database with
                        `--external-gene-callers` or something. Also, some HMM
                        collections may add new gene calls into a given
                        contigs database as an ad-hoc fashion, so if you want
                        to see all the options available to you in a given
                        contigs database, please run the program `anvi-db-
                        info` and take a look at the output. (default:
                        prodigal)

OUTPUT: Give it a nice name. Must end with '-GENOMES.db'. This is primarily due to the fact that there are other .db files used throughout anvi'o and it would be better to distinguish this very special file from them.

  -o GENOMES_STORAGE, --output-file GENOMES_STORAGE
                        File path to store results. (default: None)

anvi-gen-network

Generate a Gephi network for functions based on non-normalized gene coverage values

Usage

usage: anvi-gen-network [-h] -p PROFILE_DB -c CONTIGS_DB
                        [--annotation-source SOURCE NAME] [-l]

Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --annotation-source SOURCE NAME
                        Get functional annotations for a specific annotation
                        source. You can use the flag '--list-annotation-
                        sources' to learn about what sources are available.
                        (default: None)
  -l, --list-annotation-sources
                        List available functional annotation sources.
                        (default: False)

anvi-gen-phylogenomic-tree

Generate phylogenomic tree from aligment file

Example uses and other resources

Usage

usage: anvi-gen-phylogenomic-tree [-h] -f FASTA file -o FILE_PATH
                                  [--program PROGRAM_NAME]

Parameters

INPUT FILES: Concatenated aligment files exported using anvi-get-sequences-for-gene- clusters

  -f FASTA file, --fasta-file FASTA file
                        A FASTA-formatted input file. (default: None)

OUTPUT FILE: The output file where the generated newick tree will be stored.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

PROGRAM: The program that will be used for generating tree. Available options: default, fasttree

  --program PROGRAM_NAME
                        Program name. (default: default)

anvi-gen-structure-database

Creates a database of protein structures. Predict protein structures using template-based homology modelling of genes in your contigs database, or import pre-computed PDB structures you already have.

Example uses and other resources

Usage

usage: anvi-gen-structure-database [-h] -c CONTIGS_DB [--pdb-db PDB_DB]
                                   [--genes-of-interest FILE]
                                   [--gene-caller-ids GENE_CALLER_IDS]
                                   [-o DB_FILE_PATH] [--dump-dir DUMP_DIR]
                                   [--external-structures FILE_PATH]
                                   [--num-models NUM_MODELS]
                                   [--deviation DEVIATION]
                                   [--modeller-database MODELLER_DATABASE]
                                   [--scoring-method SCORING_METHOD]
                                   [--very-fast]
                                   [--percent-cutoff PERCENT_CUTOFF]
                                   [--alignment-fraction-cutoff ALIGNMENT_FRACTION_CUTOFF]
                                   [--max-number-templates MAX_NUMBER_TEMPLATES]
                                   [--skip-DSSP]
                                   [--modeller-executable MODELLER_EXECUTABLE]
                                   [--offline-mode] [-T NUM_THREADS]
                                   [--write-buffer-size-per-thread INT]

Parameters

DATABASES: Declaring relevant anvi'o databases. First things first.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --pdb-db PDB_DB       By default, this program accesses the structure files
                        it needs from an internal anvi'o database that can be
                        set up with anvi-setup-pdb-database. If a required
                        structure is not in this database, it will instead be
                        downloaded from the RCSB PDB server. This parameter
                        exists only if a) you created a database and b) it
                        exists in a custom location. In this case, please
                        provide that path here. Otherwise we vibing. (default:
                        None)

GENES: Specifying which genes you want to be modelled.

  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)

OUTPUT: Output file and output style.

  -o DB_FILE_PATH, --output-db-path DB_FILE_PATH
                        Output file path for the new database. (default: None)
  --dump-dir DUMP_DIR   Modeling and annotating structures requires a lot of
                        moving parts, each which have their own outputs. The
                        output of this program is a structure database
                        containing the pertinent results of this computation,
                        however a lot of stuff doesn't make the cut. By
                        providing a directory for this parameter you will get,
                        in addition to the structure database, a directory
                        containing the raw output for everything. (default:
                        None)

NON-MODELLER OPTIONS: Alternatives to using MODELLER

  --external-structures FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        PDB protein structures. The first item in the header
                        line should read 'gene_callers_id', and the second
                        should read 'path'. Each line in the file should
                        describe a single entry, where the first column is the
                        gene_callers_id that the structure corresponds to, and
                        the second column is the path to the structure file.
                        (default: None)

MODELLER PARAMS: Parameters for MODELLER's homology modeling.

  --num-models NUM_MODELS, -N NUM_MODELS
                        This parameter determines the number of predicted
                        structures that are solved for a given protein. The
                        original atomic positions for each model are perturbed
                        by an amount defined by --deviation, which leads to
                        differences between each model. Therefore, whichever
                        of the N models is chosen to be the "best" model is
                        more likely to be accurate when --num-models is high,
                        since more of the solution space is searched. It
                        should be kept in mind that the largest determinant of
                        a model's accuracy is determined by the protein
                        templates used, so no need to go overboard with an
                        excessively large --num-models. The default is 1.
  --deviation DEVIATION, -d DEVIATION
                        Deviation (angstroms) (default: 4.0)
  --modeller-database MODELLER_DATABASE, -D MODELLER_DATABASE
                        Which database do you want to search the structures
                        of? Default is "pdb_95". If you have your own database
                        it must have either the extension .bin or .pir. If you
                        don't have a database or don't know what this means,
                        don't worry, we will both inform you and take care of
                        you.
  --scoring-method SCORING_METHOD, -b SCORING_METHOD
                        How should the best model be decided? The metric used
                        could be any of GA341_score, DOPE_score, and molpdf.
                        GA341 is an absolute measure, where a good model will
                        have a score near 1.0, whereas anything below 0.6 can
                        be considered bad. DOPE and molpdf scores are relative
                        energy measures, where lower scores are better. DOPE
                        has been generally shown to be a better distinguisher
                        between good and bad models than molpdf. The default
                        is DOPE_score. To learn more see the MODELLER
                        tutorial:
                        https://salilab.org/modeller/tutorial/basic.html.
  --very-fast           If provided, a very fast optimization is done for each
                        model at the cost of accuracy. It is recommended to
                        use a --num-models of 1, since the optimization is so
                        crude that all models will likely converge to the same
                        solution. (default: False)
  --percent-cutoff PERCENT_CUTOFF, -p PERCENT_CUTOFF
                        If a protein in the database has a percent identity to
                        the gene of interest that is less than this parameter,
                        then it is not considered as a template. The default
                        is 30.000000.
  --alignment-fraction-cutoff ALIGNMENT_FRACTION_CUTOFF, -a ALIGNMENT_FRACTION_CUTOFF
                        If a protein in the database aligns to a fraction of
                        the gene of interest that is less than this parameter,
                        the template is not considered. For example, if
                        --alignment-cutoff is set to 0.90, and the fraction of
                        the gene of interest that is covered by a potential
                        template is 0.80 in their alignment, the template does
                        not align to enough of the gene of interest to be
                        considered. The default is 0.800000.
  --max-number-templates MAX_NUMBER_TEMPLATES, -t MAX_NUMBER_TEMPLATES
                        Generally speaking it is best to use as many templates
                        as possible given that they have high proper percent
                        identity to the gene of interest. Taken from https://s
                        alilab.org/modeller/methenz/andras/node4.html: 'The
                        use of several templates generally increases the model
                        accuracy. One strength of MODELLER is that it can
                        combine information from multiple template structures,
                        in two ways. First, multiple template structures may
                        be aligned with different domains of the target, with
                        little overlap between them, in which case the
                        modeling procedure can construct a homology-based
                        model of the whole target sequence. Second, the
                        template structures may be aligned with the same part
                        of the target, in which case the modeling procedure is
                        likely to automatically build the model on the locally
                        best template [43,44]. In general, it is frequently
                        beneficial to include in the modeling process all the
                        templates that differ substantially from each other,
                        if they share approximately the same overall
                        similarity to the target sequence.' The default is 5.

EXTRA: Everything else.

  --skip-DSSP           Dictionary of Secondary Structure of Proteins (DSSP)
                        is a program that takes as its input a protein
                        structure file and outputs predicted secondary
                        structure (alpha helix, beta strand, etc.), measures
                        of solvent accessibility, and hydrogen bonds for each
                        residue in the protein. If for some reason you don't
                        want this, provide this flag. (default: False)
  --modeller-executable MODELLER_EXECUTABLE
                        The MODELLER program to use. For example, `mod9.19`.
                        Anvi'o will try and find it if not provided (default:
                        None)
  --offline-mode        Anvi'o first tries to obtain template structures from
                        a database (see --pdb-db for details). If the
                        requested template does not exist in the database, its
                        structure will be downloaded from the RCSB PDB server.
                        However, if you don't have access to internet, or hate
                        the RCSB PDB, provide this flag so that all operations
                        of this program remain offline. If the template
                        structure is not in the database, then no template
                        structure for you. (default: False)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --write-buffer-size-per-thread INT
                        How many items should be kept in memory before they
                        are written do the disk. The default is 25 per thread.
                        So a single-threaded job would have a write buffer
                        size of 25, whereas a job with 4 threads would have a
                        write buffer size of 4*25. The larger the buffer size,
                        the less frequent the program will access to the disk,
                        yet the more memory will be consumed since the
                        processed items will be cleared off the memory only
                        after they are written to the disk. The default buffer
                        size will likely work for most cases. Please keep an
                        eye on the memory usage output to make sure the memory
                        use never exceeds the size of the physical memory. If
                        --num-threads is 1, this parameter is ignored because
                        the DB is written to after each gene

anvi-gen-variability-matrix

Generate a variability matrix (potentially outdated program)

Usage

usage: anvi-gen-variability-matrix [-h] -c CONTIGS_DB --splits-of-interest
                                   FILE [--samples-of-interest FILE]
                                   [--num-positions-from-each-split INT]
                                   [-m INT] [-r RATIO] [-o FILE_PATH]
                                   SUMMARY_DICT

Parameters

positional arguments:

  SUMMARY_DICT          Summary file

optional arguments:

  -h, --help            show this help message and exit
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --splits-of-interest FILE
                        A file with split names. There should be only one
                        column in the file, and each line should correspond to
                        a unique split name. (default: None)
  --samples-of-interest FILE
                        A file with samples names. There should be only one
                        column in the file, and each line should correspond to
                        a unique sample name (without a column header).
                        (default: None)
  --num-positions-from-each-split INT
                        Each split may have one or more variable positions. By
                        default, anvi'o will report every SNV position found
                        in a given split. This parameter will help you to
                        define a cutoff for the maximum number of SNVs to be
                        reported from a split (if the number of SNVs is more
                        than the number you declare using this parameter, the
                        positions will be randomly subsampled). (default: 0)
  -m INT, --min-scatter INT
                        This one is tricky. If you have N samples in your
                        dataset, a given variable position x in one of your
                        splits can split your N samples into `t` groups based
                        on the identity of the variation they harbor at
                        position x. For instance, `t` would have been 1, if
                        all samples had the same type of variation at position
                        x (which would not be very interesting, because in
                        this case position x would have zero contribution to a
                        deeper understanding of how these samples differ based
                        on variability. When `t` > 1, it would mean that
                        identities at position x across samples do differ. But
                        how much scattering occurs based on position x when t
                        > 1? If t=2, how many samples ended in each group?
                        Obviously, even distribution of samples across groups
                        may tell us something different than uneven
                        distribution of samples across groups. So, this
                        parameter filters out any x if 'the number of samples
                        in the second largest group' (=scatter) is less than
                        -m. Here is an example: let's assume you have 7
                        samples. While 5 of those have AG, 2 of them have TC
                        at position x. This would mean scatter of x is 2. If
                        you set -m to 2, this position would not be reported
                        in your output matrix. The default value for -m is 0,
                        which means every `x` found in the database and
                        survived previous filtering criteria will be reported.
                        Naturally, -m cannot be more than half of the number
                        of samples. Please refer to the user documentation if
                        this is confusing.
  -r RATIO, --min-ratio-of-competings-nts RATIO
                        Minimum ratio of the competing nucleotides at a given
                        position. Default is 0.
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: variability.txt)

anvi-gen-variability-network

A program to generate a network description from an anvi'o variability profile (potentially outdated program)

Example uses and other resources

Usage

usage: anvi-gen-variability-network [-h] -i VARIABILITY_PROFILE
                                    [-n NUM_POSITIONS] [-o FILE_PATH]

Parameters

optional arguments:

  -i VARIABILITY_PROFILE, --input-file VARIABILITY_PROFILE
                        The anvi'o variability profile. Please see `anvi-gen-
                        variability-profile` to generate one. (default: None)
  -n NUM_POSITIONS, --max-num-unique-positions NUM_POSITIONS
                        Maximum number of unique positions to be used in the
                        network. This may be one way to avoid extremely large
                        network descriptions that would defeat the purpose of
                        a quick visualization. If there are more unique
                        positions in the variability profile, the program will
                        randomly select a subset of them to match the `max-
                        num-unique-positions`. The default is 0, which means
                        all positions should be reported. Remember that the
                        number of nodes in the network will also depend on the
                        number of samples described in the variability
                        profile.
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: network.gexf)

anvi-gen-variability-profile

Generate a table that comprehensively summarizes the variability of nucleotide, codon, or amino acid positions. We call these single nucleotide variants (SNVs), single codon variants (SCVs), and single amino acid variants (SAAVs), respectively

Example uses and other resources

Usage

usage: anvi-gen-variability-profile [-h] -p PROFILE_DB -c CONTIGS_DB
                                    [-s STRUCTURE_DB] [-C COLLECTION_NAME]
                                    [-b BIN_NAME] [--splits-of-interest FILE]
                                    [--genes-of-interest FILE]
                                    [--gene-caller-ids GENE_CALLER_IDS]
                                    [--only-if-structure]
                                    [--samples-of-interest FILE]
                                    [--engine ENGINE]
                                    [--num-positions-from-each-split INT]
                                    [-r FLOAT] [-z FLOAT] [-j FLOAT]
                                    [-a FLOAT] [-x NUM_SAMPLES]
                                    [--min-coverage-in-each-sample INT]
                                    [--quince-mode] [--kiefl-mode]
                                    [-o VARIABILITY_PROFILE]
                                    [--include-contig-names]
                                    [--include-split-names]
                                    [--include-additional-data]
                                    [--include-site-pnps]
                                    [--compute-gene-coverage-stats]

Parameters

DATABASES: Declaring relevant anvi'o databases. First things first. Some are mandatory, some are optional.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -s STRUCTURE_DB, --structure-db STRUCTURE_DB
                        Anvi'o structure database. (default: None)

FOCUS :: BIN: You need to pick someting to focus. You can ask anvi'o to work with a bin in a collection.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)

FOCUS :: SPLIT NAMES: Alternatively you can declare split names to focus.

  --splits-of-interest FILE
                        A file with split names. There should be only one
                        column in the file, and each line should correspond to
                        a unique split name. (default: None)

FOCUS :: GENE CALLER IDs: Alternatively you can declare gene caller IDs to focus.

  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  --only-if-structure   If provided, your genes of interest will be further
                        subset to only include genes with structures in your
                        structure database, and therefore must be supplied in
                        conjunction with a structure database, i.e. `-s
                        <your_structure_database>`. If you did not specify
                        genes of interest, ALL genes will be subset to those
                        that have structures. (default: False)

SAMPLES: You can ask anvi'o to focus only on a subset of samples.

  --samples-of-interest FILE
                        A file with samples names. There should be only one
                        column in the file, and each line should correspond to
                        a unique sample name (without a column header).
                        (default: None)

ENGINE: Set your engine. This is important as it will define the output profile you will get from this program. The engine can focus on nucleotides (NT), codons (CDN), or an amino acids (AA).

  --engine ENGINE       Variability engine. The default is 'NT'.

FILTERS: Parameters that will help you to do a very precise analysis. If you declare nothing from this bunch, you will get "everything" to play with, which is not necessarily a good thing…

  --num-positions-from-each-split INT
                        Each split may have one or more variable positions. By
                        default, anvi'o will report every SNV position found
                        in a given split. This parameter will help you to
                        define a cutoff for the maximum number of SNVs to be
                        reported from a split (if the number of SNVs is more
                        than the number you declare using this parameter, the
                        positions will be randomly subsampled). (default: 0)
  -r FLOAT, --min-departure-from-reference FLOAT
                        Takes a value between 0 and 1, where 1 is maximum
                        divergence from the reference. Default is 0.000000.
                        The reference here observation that corresponds to a
                        given position in the mapped context.
  -z FLOAT, --max-departure-from-reference FLOAT
                        Similar to '--min-departure-from-reference', but
                        defines an upper limit for divergence. The default is
                        1.000000.
  -j FLOAT, --min-departure-from-consensus FLOAT
                        Takes a value between 0 and 1, where 1 is maximum
                        divergence from the consensus for a given position.
                        The default is 0.000000. The consensus is the most
                        frequent observation at a given position.
  -a FLOAT, --max-departure-from-consensus FLOAT
                        Similar to '--min-departure-from-consensus', but
                        defines an upper limit for divergence. The default is
                        1.000000.
  -x NUM_SAMPLES, --min-occurrence NUM_SAMPLES
                        Minimum number of samples a nucleotide position should
                        be reported as variable. Default is 1. If you set it
                        to 2, for instance, each eligible variable position
                        will be expected to appear in at least two samples,
                        which will reduce the impact of stochastic, or
                        unintelligible variable positions.
  --min-coverage-in-each-sample INT
                        Minimum coverage of a given variable nucleotide
                        position in all samples. If a nucleotide position is
                        covered less than this value even in one sample, it
                        will be removed from the analysis. Default is 0.
  --quince-mode         The default behavior is to report allele frequencies
                        only at positions where variation was reported during
                        profiling (which by default uses some heuristics to
                        minimize the impact of error-driven variation). So, if
                        there are 10 samples, and a given position has been
                        reported as a variable site during profiling in only
                        one of those samples, there will be no information
                        will be stored in the database for the remaining 9.
                        When this flag is used, we go back to each sample, and
                        report allele frequencies for each sample at this
                        position, even if they do not vary. It will take
                        considerably longer to report when this flag is on,
                        and the use of it will increase the file size
                        dramatically, however it is inevitable for some
                        statistical approaches and visualizations. (default:
                        False)
  --kiefl-mode          The default behavior is to report codon/amino-acid
                        frequencies only at positions where variation was
                        reported during profiling (which by default uses some
                        heuristics to minimize the impact of error-driven
                        variation). When this flag is used, all positions are
                        reported, regardless of whether they contained
                        variation in any sample. The reference codon for all
                        such entries is given a codon frequency of 1. All
                        other entries (aka those with legitimate variation to
                        be reported) remain unchanged. This flag can only be
                        used with `--engine AA` or `--engine CDN` and is
                        incompatible wth --quince-mode. (default: False)

OUTPUT: Output file and style

  -o VARIABILITY_PROFILE, --output-file VARIABILITY_PROFILE
                        File path to store results. (default: variability.txt)
  --include-contig-names
                        Use this flag if you would like contig names for each
                        variable position to be included in the output file as
                        a column. By default, we do not include contig names
                        since they can practically double the output file size
                        without any actual benefit in most cases. (default:
                        False)
  --include-split-names
                        Use this flag if you would like split names for each
                        variable position to be included in the output file as
                        a column. (default: False)
  --include-additional-data
                        Use this flag if you would like to append data stored
                        in the `amino_acid_additional_data` table as
                        additional columns to your output. NOTE: This is not
                        yet implemented for the `nucleotide_additional_data`
                        table. (default: False)
  --include-site-pnps   Use this flag if you want per-site pN and pS added as
                        additional columns. Synonymity will be calculated with
                        respect to the reference, to the consenus, and to the
                        most common consensus seen at that site across samples
                        (popular consensus). The number of synonymous and
                        nonsynonymous sites will also be stored for each case.
                        This makes a total of 12 added columns. This flag will
                        be ignored if --engine is not CDN. (default: False)
  --compute-gene-coverage-stats
                        If provided, gene coverage statistics will be appended
                        for each entry in variability report. This is very
                        useful information, but will not be included by
                        default because it is an expensive operation, and may
                        take some additional time. (default: False)

anvi-get-aa-counts

Fetches the number of times each amino acid occurs from a contigs database in a given bin, set of contigs, or set of genes

Usage

usage: anvi-get-aa-counts [-h] -c CONTIGS_DB [-o FILE_PATH] [-p PROFILE_DB]
                          [-C COLLECTION_NAME] [-B FILE_PATH]
                          [--contigs-of-interest FILE]
                          [--gene-caller-ids GENE_CALLER_IDS]

Parameters

MANDATORY STUFF: You have to set the following two parameters, then you will select one set of parameters from the following optional sections. If you select nothing from those sets, AA counts for everything in the contigs database will be reported.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

OPTIONAL PARAMS FOR BINS:

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)

OPTIONAL PARAMS FOR CONTIGS:

  --contigs-of-interest FILE
                        A file with contig names. There should be only one
                        column in the file, and each line should correspond to
                        a unique split name. (default: None)

OPTIONAL PARAMS FOR GENE CALLS:

  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)

anvi-get-codon-frequencies

Get amino acid or codon frequencies of genes in a contigs database

Usage

usage: anvi-get-codon-frequencies [-h] -c CONTIGS_DB
                                  [--gene-caller-id GENE_CALLER_ID]
                                  [--collapse-genes]
                                  [--return-AA-frequencies-instead] -o
                                  FILE_PATH [--percent-normalize]
                                  [--merens-codon-normalization]

Parameters

INPUT DATABASE: The contigs database. Clearly those genes must be read from somewhere.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

OPTIONALS: Important things to read never end. Stupid science.

  --gene-caller-id GENE_CALLER_ID
                        OK. You can declare a single gene caller ID if you
                        wish, in which case anvi'o would only return results
                        for a single gene call. If you don't declare anything,
                        well, you must be prepared to brace yourself if you
                        are working with a very large contigs database with
                        hundreds of thousands of genes. (default: None)
  --collapse-genes      By default, codon frequencies are reported on a per-
                        gene basis, meaning that a frequency is reported for
                        each gene-codon pairing. If you provide this flag,
                        codon frequencies will instead be collapsed across
                        genes, such that a single frequency is reported for
                        each codon. (default: False)
  --return-AA-frequencies-instead
                        By default, anvi'o will return codon frequencies (as
                        the name suggests), but you can ask for amino acid
                        frequencies instead, simply because you always need
                        more data and more stuff. You're lucky this time, but
                        is there an end to this? Will you ever be satisfied
                        with what you have? Anvi'o needs answers. (default:
                        False)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --percent-normalize   Instead of actual counts, report percent-normalized
                        frequencies per gene (because you are too lazy to do
                        things the proper way in R). (default: False)
  --merens-codon-normalization
                        This is a flag to percent normalize codon frequenies
                        within those that encode for the same amino acid. It
                        is different from the flag --percent-normalize, since
                        it does not percent normalize frequencies of codons
                        within a gene based on all codon frequencies. Clearly
                        this flag is not applicable if you wish to work with
                        boring amino acids. WHO WORKS WITH AMINO ACIDS
                        ANYWAYS. (default: False)

anvi-get-pn-ps-ratio

FIXME

Usage

usage: anvi-get-pn-ps-ratio [-h] [-V SCV_FILE] -c CONTIGS_DB [-j FLOAT]
                            [-r FLOAT] [-i MINIMUM_NUM_VARIANTS]
                            [-m MIN_COVERAGE]
                            [-x {reference,consensus,popular_consensus}] -o
                            DIR_PATH [-p]

Parameters

VARIABILITY: Provide a SCV table that can be generated with anvi-gen-variability- profile.

  -V SCV_FILE, --variability-profile SCV_FILE
                        Filepath to the SCV table. (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Filepath to the contigs database used to generate
                        variability table. (default: None)

TUNABLES: Successfully tune one or more of these parameters to unlock the badge 'Advanced anvian'.

  -j FLOAT, --min-departure-from-consensus FLOAT
                        SCVs will be ignored if they have a departure from
                        consensus less than this value. Note: Keep in mind you
                        may have already supplied this parameter during anvi-
                        gen-variability-profile. The default value is 0.00.
  -r FLOAT, --min-departure-from-reference FLOAT
                        SCVs will be ignored if they have a departure from
                        reference less than this value. Note: Keep in mind you
                        may have already supplied this parameter during anvi-
                        gen-variability-profile. The default value is 0.00.
  -i MINIMUM_NUM_VARIANTS, --minimum-num-variants MINIMUM_NUM_VARIANTS
                        Ignore groups with less than this number of single
                        codon variants. pN, pS, and pN/pS values will be set
                        to NaN for these groups (default: 4)
  -m MIN_COVERAGE, --min-coverage MIN_COVERAGE
                        If the coverage value at a codon is less than this
                        amount, any associated SCVs will be ignored. The
                        default is 30.
  -x {reference,consensus,popular_consensus}, --comparison {reference,consensus,popular_consensus}
                        You can determine synonymity relative to either the
                        reference codon, or the consensus codon. The consensus
                        codon is determined on a per-sample basis. The default
                        is 'reference.'

OUTPUT: The output of this program is a folder directory with several tables.

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -p, --pivot           By default the output is in long format, however you
                        can choose the output to be in matrix form with this
                        flag. If you're not sure which one is right for you,
                        just try one and take a look at the output--there is
                        no cost for making a mistake :) (default: False)

anvi-get-sequences-for-gene-calls

A script to get back sequences for gene calls

Example uses and other resources

Usage

usage: anvi-get-sequences-for-gene-calls [-h] [-c CONTIGS_DB]
                                         [--gene-caller-ids GENE_CALLER_IDS]
                                         [--flank-length INT]
                                         [--delimiter CHAR]
                                         [--report-extended-deflines]
                                         [--wrap WRAP] [--export-gff3]
                                         [--get-aa-sequences]
                                         [--external-gene-calls GENE-CALLS]
                                         [-g GENOMES_STORAGE]
                                         [-G GENOME_NAMES] -o FILE_PATH

Parameters

OPTION #1: EXPORT FROM CONTIGS DB:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  --flank-length INT    Extend sequences for gene calls with additional
                        nucleotides from both ends. If the seqeunce for a
                        target gene is between nucleotide positions START and
                        STOP, using a flank lenght of M will give you a
                        sequence that starts at START - M and ends at STOP +
                        M. (default: 0)
  --delimiter CHAR      The delimiter to parse multiple input terms. The
                        default is ','.
  --report-extended-deflines
                        When declared, the deflines in the resulting FASTA
                        file will contain more information. (default: False)
  --wrap WRAP           When to wrap sequences when storing them in a FASTA
                        file. The default is '120'. A value of '0' would be
                        equivalent to 'do not wrap'.
  --export-gff3         If this is true, the output file will be in GFF3
                        format. (default: False)
  --get-aa-sequences    Store amino acid sequences instead. (default: False)
  --external-gene-calls GENE-CALLS
                        An optional external gene calls file path that
                        precisely describes the set of gene sequences
                        exported. Using this file you can create an anvi'o
                        contigs database from the resulting genes FASTA file
                        without having to do a gene calling from scratch.
                        (default: None)

OPTION #2: EXPORT FROM A GENOMES STORAGE:

  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)
  -G GENOME_NAMES, --genome-names GENOME_NAMES
                        Genome names to 'focus'. You can use this parameter to
                        limit the genomes included in your analysis. You can
                        provide these names as a comma-separated list of
                        names, or you can put them in a file, where you have a
                        single genome name in each line, and provide the file
                        path. (default: None)

OPTIONS COMMON TO ALL INPUTS:

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-get-sequences-for-gene-clusters

Do cool stuff with gene clusters in anvi'o pan genomes

Example uses and other resources

Usage

usage: anvi-get-sequences-for-gene-clusters [-h] -p PAN_DB
                                            [-g GENOMES_STORAGE] [-o FASTA]
                                            [--report-DNA-sequences]
                                            [--gene-cluster-id GENE_CLUSTER_ID]
                                            [--gene-cluster-ids-file FILE_PATH]
                                            [-C COLLECTION_NAME] [-b BIN_NAME]
                                            [--min-num-genomes-gene-cluster-occurs INTEGER]
                                            [--max-num-genomes-gene-cluster-occurs INTEGER]
                                            [--min-num-genes-from-each-genome INTEGER]
                                            [--max-num-genes-from-each-genome INTEGER]
                                            [--max-num-gene-clusters-missing-from-genome INTEGER]
                                            [--min-functional-homogeneity-index FLOAT]
                                            [--max-functional-homogeneity-index FLOAT]
                                            [--min-geometric-homogeneity-index FLOAT]
                                            [--max-geometric-homogeneity-index FLOAT]
                                            [--min-combined-homogeneity-index FLOAT]
                                            [--max-combined-homogeneity-index FLOAT]
                                            [--add-into-items-additional-data-table NAME]
                                            [--list-collections] [--list-bins]
                                            [--concatenate-gene-clusters]
                                            [--partition-file FILE_PATH]
                                            [--separator STRING]
                                            [--align-with ALIGNER]
                                            [--list-aligners] [--just-do-it]
                                            [--dry-run]

Parameters

INPUT FILES: Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

OUTPUT: You get to chose an output file name to report things. The default will be an ugly name. So, be explicit.

  -o FASTA, --output-file FASTA
                        File path to store results. (default: None)
  --report-DNA-sequences
                        By default, this program reports amino acid sequences.
                        Use this flag to report DNA sequences instead.
                        (default: False)

SELECTION: Which gene clusters should be reported. You can ask for a single gene cluster, or multiple ones listed in a file, or you can use a collection and bin name to list gene clusters of interest. If you give nothing, this program will export alignments for every single gene cluster found in the profile database (and this is called 'customer service').

  --gene-cluster-id GENE_CLUSTER_ID
                        Gene cluster ID you are interested in. (default: None)
  --gene-cluster-ids-file FILE_PATH
                        Text file for gene clusters (each line should contain
                        be a unique gene cluster id). (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)

ADVANCED FILTERS: If you are here you must be looking for ways to specify exactly what you want from that database of gene clusters. These filters will be applied to what your previous selections reported.

  --min-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        least X number of genomes. If you say '--min-num-
                        genomes-gene-cluster-occurs 90', each gene cluster in
                        the analysis will be required at least to appear in 90
                        genomes. If a gene occurs in less than that number of
                        genomes, it simply will not be reported. This is
                        especially useful for phylogenomic analyses, where you
                        may want to only focus on gene clusters that are
                        prevalent across the set of genomes you wish to
                        analyze. (default: 0)
  --max-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        most X number of genomes. If you say '--max-num-
                        genomes-gene-cluster-occurs 1', you will get gene
                        clusters that are singletons. Combining this parameter
                        with --min-num-genomes-gene-cluster-occurs can give
                        you a very precise way to filter your gene clusters.
                        (default: 9223372036854775807)
  --min-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--min-num-genes-from-each-genome
                        2', this filter will remove every gene cluster, to
                        which every genome in your analysis contributed less
                        than 2 genes. This can be useful to find out gene
                        clusters with many genes from many genomes (such as
                        conserved multi-copy genes within a clade). (default:
                        0)
  --max-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--max-num-genes-from-each-genome
                        1', every gene cluster that has more than one gene
                        from any genome that contributes to it will be removed
                        from your analysis. This could be useful to remove
                        gene clusters with paralogs from your report for
                        appropriate phylogenomic analyses. For instance, using
                        '--max-num-genes-from-each-genome 1' and 'min-num-
                        genomes-gene-cluster-occurs X' where X is the total
                        number of your genomes, would give you the single-copy
                        gene clusters in your pan genome. (default:
                        9223372036854775807)
  --max-num-gene-clusters-missing-from-genome INTEGER
                        This filter will remove genomes from your report. If
                        you have a list of gene cluster names, you can use
                        this parameter to omit any genome from your report if
                        it is missing more than a number of genes you desire.
                        For instance, if you have 100 genomes in your pan
                        genome, and you are interested in working only with
                        genomes that have all 5 specific gene clusters of your
                        choice, you can use '--max-num-gene-clusters-missing-
                        from-genome 4' to remove remove the bins that are
                        missing more than 4 of those 5 genes. This is
                        especially useful for phylogenomic analyses. Parameter
                        0 will remove any genome that is missing any of the
                        genes. (default: 0)
  --min-functional-homogeneity-index FLOAT
                        This filter will remove gene clusters from your
                        report. If you say '--min-functional-homogeneity-index
                        0.3', every gene cluster with a functional homogeneity
                        index less than 0.3 will be removed from your
                        analysis. This can be useful if you only want to look
                        at gene clusters that are highly conserved in
                        resulting function (default: -1)
  --max-functional-homogeneity-index FLOAT
                        This filter will remove gene clusters from your
                        report. If you say '--max-functional-homogeneity-index
                        0.5', every gene cluster with a functional homogeneity
                        index greater than 0.5 will be removed from your
                        analysis. This can be useful if you only want to look
                        at gene clusters that don't seem to be functionally
                        conserved (default: 1)
  --min-geometric-homogeneity-index FLOAT
                        This filter will remove gene clusters from your
                        report. If you say '--min-geometric-homogeneity-index
                        0.3', every gene cluster with a geometric homogeneity
                        index less than 0.3 will be removed from your
                        analysis. This can be useful if you only want to look
                        at gene clusters that are highly conserved in
                        geometric configuration (default: -1)
  --max-geometric-homogeneity-index FLOAT
                        This filter will remove gene clusters from your
                        report. If you say '--max-geometric-homogeneity-index
                        0.5', every gene cluster with a geometric homogeneity
                        index greater than 0.5 will be removed from your
                        analysis. This can be useful if you only want to look
                        at gene clusters that have many not be as conserved as
                        others (default: 1)
  --min-combined-homogeneity-index FLOAT
                        This filter will remove gene clusters from your
                        report. If you say '--min-combined-homogeneity-index
                        0.3', every gene cluster with a combined homogeneity
                        index less than 0.3 will be removed from your
                        analysis. This can be useful if you only want to look
                        at gene clusters that are highly conserved overall
                        (default: -1)
  --max-combined-homogeneity-index FLOAT
                        This filter will remove gene clusters from your
                        report. If you say '--max-combined-homogeneity-index
                        0.5', every gene cluster with a combined homogeneity
                        index greater than 0.5 will be removed from your
                        analysis. This can be useful if you only want to look
                        at gene clusters that have many not be as conserved
                        overall as others (default: 1)
  --add-into-items-additional-data-table NAME
                        If you use any of the filters, and would like to add
                        the resulting item names into the items additional
                        data table of your database, you can use this
                        parameter. You will need to give a name for these
                        results to be saved. If the given name is already in
                        the items additional data table, its contents will be
                        replaced with the new one. Then you can run anvi-
                        interactive or anvi-display-pan to 'see' the results
                        of your filters. (default: None)

OTHER STUFF: Yes. Stuff that are not like the ones above.

  --list-collections    Show available collections and exit. (default: False)
  --list-bins           List available bins in a collection and exit.
                        (default: False)

PHYLOGENOMICS: Get separately aligned and concatenated sequences for phylogenomics.

  --concatenate-gene-clusters
                        Concatenate output gene clusters in the same order to
                        create a multi-gene alignment output that is suitable
                        for phylogenomic analyses. (default: False)
  --partition-file FILE_PATH
                        Some commonly used software for phylogenetic analyses
                        (e.g., IQ-TREE, RAxML, etc) allow users to
                        specify/test different substitution models for each
                        gene of a concatenated multiple sequence alignments.
                        For this, they use a special file format called a
                        'partition file', which indicates the site for each
                        gene in the alignment. You can use this parameter to
                        declare an output path for anvi'o to report a NEXUS
                        format partition file in addition to your FASTA output
                        (requested by Massimiliano Molari in #1333). (default:
                        None)
  --separator STRING    Characters to separate things (the default is whatever
                        is most suitable). (default: None)
  --align-with ALIGNER  The multiple sequence alignment program to use when
                        multiple sequence alignment is necessary. To see all
                        available options, use the flag `--list-aligners`.
                        (default: None)
  --list-aligners       Show available software for multiple sequence
                        alignment. (default: False)

LIFE SAVERS: Just when you need them.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)

anvi-get-sequences-for-hmm-hits

Get sequences for HMM hits from many inputs

Example uses and other resources

Usage

usage: anvi-get-sequences-for-hmm-hits [-h] [-c CONTIGS_DB] [-p PROFILE_DB]
                                       [-C COLLECTION_NAME] [-b BIN_NAME]
                                       [-B FILE_PATH] [-e FILE_PATH]
                                       [-i FILE_PATH]
                                       [--hmm-sources SOURCE NAME]
                                       [--gene-names HMM HIT NAME] [-l] [-L]
                                       [-o FILE_PATH] [--no-wrap]
                                       [--get-aa-sequences]
                                       [--concatenate-genes]
                                       [--partition-file FILE_PATH]
                                       [--max-num-genes-missing-from-bin INTEGER]
                                       [--min-num-bins-gene-occurs INTEGER]
                                       [--align-with ALIGNER]
                                       [--separator STRING]
                                       [--return-best-hit] [--unique-genes]
                                       [--just-do-it]

Parameters

INPUT OPTION #1: CONTIGS DB: There are multiple ways to access to sequences. Your first option is to provide a contigs database, and call it a day. In this case the program will return you everything from it.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

INPUT OPTION #2: CONTIGS DB + PROFLIE DB: You can also work with anvi'o profile databases and collections stored in them. If you go this way, you still will need to provide a contigs database. If you just specify a collection name, you will get hits from every bin in it. You can also use the bin name or bin ids file parameters to specify your interest more precisely.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)

INPUT OPTION #3: INT/EXTERNAL GENOMES FILE: Yes. You can alternatively use as input an internal or external genomes file, or both of them together. If you have multiple contigs databases without any profile database, you can use the external genomes file. So if you just have a bunch of FASTA files and nothing else, this is what you need. In contrast, if you want to access to genes in bins described in collections stored in anvi'o profile databases, then you can use internal genomes file route. Or you can mix the two, because why not. There is not much room for excuses here.

  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)

HMM STUFF: This is where you can specify an HMM source, and/or a list of genes to filter your results.

  --hmm-sources SOURCE NAME
                        Get sequences for a specific list of HMM sources. You
                        can list one or more sources by separating them from
                        each other with a comma character (i.e., '--hmm-
                        sources source_1,source_2,source_3'). If you would
                        like to see a list of available sources in the contigs
                        database, run this program with '--list-hmm-sources'
                        flag. (default: None)
  --gene-names HMM HIT NAME
                        Get sequences only for a specific gene name. Each name
                        should be separated from each other by a comma
                        character. For instance, if you want to get back only
                        RecA and Ribosomal_L27, you can type '--gene-names
                        RecA,Ribosomal_L27', and you will get any and every
                        hit that matches these names in any source. If you
                        would like to see a list of available gene names, you
                        can use '--list-available-gene-names' flag. (default:
                        None)
  -l, --list-hmm-sources
                        List available HMM sources in the contigs database and
                        quit. (default: False)
  -L, --list-available-gene-names
                        List available gene names in HMM sources selection and
                        quit. (default: False)

THE OUTPUT: Where should the output go. It will be a FASTA file, and you better give it a nice name..

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --no-wrap             Do not be wrap sequences nicely in the output file.
                        (default: False)

THE ALPHABET: The sequences are reported in DNA alphabet, but you can also get them translated just like all the other cool kids.

  --get-aa-sequences    Store amino acid sequences instead. (default: False)

PHYLOGENOMICS? K!: If you want, you can get your sequences concatanated. In this case anwi'o will use muscle to align every homolog, and concatenate them the order you specified using the gene-names argument. Each concatenated sequence will be separated from the other ones by the separator.

  --concatenate-genes   Concatenate output genes in the same order to create a
                        multi-gene alignment output that is suitable for
                        phylogenomic analyses. (default: False)
  --partition-file FILE_PATH
                        Some commonly used software for phylogenetic analyses
                        (e.g., IQ-TREE, RAxML, etc) allow users to
                        specify/test different substitution models for each
                        gene of a concatenated multiple sequence alignments.
                        For this, they use a special file format called a
                        'partition file', which indicates the site for each
                        gene in the alignment. You can use this parameter to
                        declare an output path for anvi'o to report a NEXUS
                        format partition file in addition to your FASTA output
                        (requested by Massimiliano Molari in #1333). (default:
                        None)
  --max-num-genes-missing-from-bin INTEGER
                        This filter removes bins (or genomes) from your
                        analysis. If you have a list of gene names, you can
                        use this parameter to omit any bin (or external
                        genome) that is missing more than a number of genes
                        you desire. For instance, if you have 100 genome bins,
                        and you are interested in working with 5 ribosomal
                        proteins, you can use '--max-num-genes-missing-from-
                        bin 4' to remove the bins that are missing more than 4
                        of those 5 genes. This is especially useful for
                        phylogenomic analyses. Parameter 0 will remove any bin
                        that is missing any of the genes. (default: None)
  --min-num-bins-gene-occurs INTEGER
                        This filter removes genes from your analysis. Let's
                        assume you have 100 bins to get sequences for HMM
                        hits. If you want to work only with genes among all
                        the hits that occur in at least X number of bins, and
                        discard the rest of them, you can use this flag. If
                        you say '--min-num-bins-gene-occurs 90', each gene in
                        the analysis will be required at least to appear in 90
                        genomes. If a gene occurs in less than that number of
                        genomes, it simply will not be reported. This is
                        especially useful for phylogenomic analyses, where you
                        may want to only focus on genes that are prevalent
                        across the set of genomes you wish to analyze.
                        (default: None)
  --align-with ALIGNER  The multiple sequence alignment program to use when
                        multiple sequence alignment is necessary. To see all
                        available options, use the flag `--list-aligners`.
                        (default: None)
  --separator STRING    A word that will be used to sepaate concatenated gene
                        sequences from each other (IF you are using this
                        program with `--concatenate-genes` flag). The default
                        is "XXX" for amino acid sequences, and "NNN" for DNA
                        sequences (default: None)

OPTIONAL: Everything is optional, but some options are more optional than others.

  --return-best-hit     A bin (or genome) may contain more than one hit for a
                        gene name in a given HMM source. For instance, there
                        may be multiple RecA hits in a genome bin from
                        Campbell et al.. Using this flag, will go through all
                        of the gene names that appear multiple times, and
                        remove all but the one with the lowest e-value. Good
                        for whenever you really need to get only a single copy
                        of single-copy core genes from a genome bin. (default:
                        False)
  --unique-genes        An HMM source may contain multiple models that can hit
                        the same gene in a given bin or genome. Using this
                        flag, you can ask anvi'o to go through all genes,
                        identify those with multiple hits and report only the
                        most significant hit for each unique gene. (default:
                        False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-get-short-reads-from-bam

Get short reads back from a BAM file with options for compression, splitting of forward and reverse reads, etc

Usage

usage: anvi-get-short-reads-from-bam [-h] -p PROFILE_DB -c CONTIGS_DB
                                     [-C COLLECTION_NAME] [-b BIN_NAME]
                                     [-B FILE_PATH] [-o FILE_PATH]
                                     [-O FILENAME_PREFIX] [-X] [-Q]
                                     BAM FILE[S] [BAM FILE[S] ...]

Parameters

positional arguments:

  BAM FILE[S]           BAM file(s) to access to recover short reads

optional arguments:

  -h, --help            show this help message and exit
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  -X, --gzip-output     When declared, output file(s) will be gzip compressed
                        and the extension `.gz` will be added. (default:
                        False)
  -Q, --split-R1-and-R2
                        When declared, this program outputs 3 FASTA files for
                        paired-end reads: one for R1, one for R2, and one for
                        unpaired reads. (default: False)

anvi-get-short-reads-mapping-to-a-gene

Recover short reads from BAM files that were mapped to genes you are interested in. It is possible to work with a single gene call, or a bunch of them. Similarly, you can get short reads from a single BAM file, or from many of them

metagenomics profile_db contigs_db bam variability clustering

Usage

usage: anvi-get-short-reads-mapping-to-a-gene [-h] -c CONTIGS_DB -i
                                              INPUT_BAMS) [INPUT_BAM(S ...]
                                              [--gene-caller-id GENE_CALLER_ID]
                                              [--genes-of-interest FILE]
                                              [--leeway LEEWAY_NTs]
                                              [-O FILENAME_PREFIX]

Parameters

INPUT FILES: An anvi'o contigs database and one or more BAM files.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -i INPUT_BAM(S) [INPUT_BAM(S) ...], --input-files INPUT_BAM(S) [INPUT_BAM(S) ...]
                        Sorted and indexed BAM files to analyze. It is
                        essential that all BAM files must be the result of
                        mappings against the same contigs. (default: None)

GENES: Gene calls you want to work with

  --gene-caller-id GENE_CALLER_ID
                        A single gene id. (default: None)
  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)
  --leeway LEEWAY_NTs   The minimum number of nucleotides for a given short
                        read mapping into the gene context for it to be
                        reported. You must consider the length of your short
                        reads, as well as the length of the gene you are
                        targeting. The default is 100 nts.

OUTPUT: How should results be stored.

  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)

anvi-get-split-coverages

Export splits and the coverage table from database

Example uses and other resources

Usage

usage: anvi-get-split-coverages [-h] -p PROFILE_DB [-c CONTIGS_DB]
                                [--split-name SPLIT_NAME]
                                [--contig-name CONTIG_NAME]
                                [-C COLLECTION_NAME] [-b BIN_NAME]
                                [--gene-caller-id GENE_CALLER_ID]
                                [--flank-length INT] [-o FILE_PATH]
                                [--list-splits] [--list-collections]
                                [--list-bins]

Parameters

ESSENTIAL ANVI'O DBs: You need to provide a profile database, but whether you will need to provide a contigs database will depend on which input option you want to go with.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

INPUT OPTION #1: SPLIT or CONTIG NAME: You want nothing but the coverage values in a single split .. or a contig. FINE.

  --split-name SPLIT_NAME
                        Split name. (default: None)
  --contig-name CONTIG_NAME
                        Contig name. (default: None)

INPUT OPTION #2: COLLECTION + BIN: You want nucletide-level coverage values for all splits in a bin. FANCY.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)

INPUT OPTION #3: GENE CALL: You want nucletide-level coverage values for a given gene call. PRO.

  --gene-caller-id GENE_CALLER_ID
                        A single gene id. (default: None)
  --flank-length INT    Extend sequences for gene calls with additional
                        nucleotides from both ends. If the seqeunce for a
                        target gene is between nucleotide positions START and
                        STOP, using a flank lenght of M will give you a
                        sequence that starts at START - M and ends at STOP +
                        M. (default: 0)

BORING STUFF: The output file and all.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --list-splits         When declared, the program will list split names in
                        the profile database and quite (default: False)
  --list-collections    Show available collections and exit. (default: False)
  --list-bins           List available bins in a collection and exit.
                        (default: False)

anvi-get-tlen-dist-from-bam

Report the distribution of template lengths from a BAM file. The purpose of this is to get an idea about the insert size distribution in a BAM file rapidly by summarizing distances between each paired-end read in a given read recruitment experiment.

Usage

usage: anvi-get-tlen-dist-from-bam [-h]
                                   [--min-template-length-frequency MIN_TEMPLATE_LENGTH_FREQUENCY]
                                   [--max-template-length-to-consider MAX_TEMPLATE_LENGTH_TO_CONSIDER]
                                   [-o FILE_PATH] [--plot-data]
                                   BAM_FILE

Parameters

positional arguments:

  BAM_FILE              An indexed BAM file

optional arguments:

  -h, --help            show this help message and exit

INPUT OPTIONS: Things you don't care but should.

  --min-template-length-frequency MIN_TEMPLATE_LENGTH_FREQUENCY
                        How many times a template lenght should be observed to
                        be considered as a viable template lenght? If this
                        number is zero, you will have an extremely large
                        number template lengths that will likely be due to
                        noise. This number can be best set as a function of
                        the total number of mapped reads in a bam file. The
                        default value is set with the assumption that you have
                        millions of reads in the BAM file. (default: 10)
  --max-template-length-to-consider MAX_TEMPLATE_LENGTH_TO_CONSIDER
                        Some paired end reads will map incredibly distant
                        locations given the linear sequences you have used for
                        read recruitment if the DNA templates were circular in
                        the sample used for sequencing. Which is a beautiful
                        thing, but can ruin your histogram. Using this
                        parameter, you can discard paired end reads that are
                        extremely distant from one another. (default: 500000)

OUTPUT OPTIONS: Output options.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --plot-data           In addition to providing you with a TAB-delimited
                        output file, anvi'o can also try to plot a summary
                        histogram for the template length distribution across
                        ALL contigs in a given BAM file (default: False)

anvi-help

Search for anvi'o programs by keyword, inputs/outputs, etc

Usage

usage: anvi-help [-h] [--requires] [--provides] [--name] [--report REPORT]
                 [search-term]

Parameters

positional arguments:

  search-term           Find programs associated with this search term
                        (optional) (default: ALL)

optional arguments:

  -h, --help            show this help message and exit
  --requires, -r        Restrict to programs that require this search term
                        (default: False)
  --provides, -p        Restrict to programs that provide this search term
                        (default: False)
  --name, -n            Restrict to programs that contain this search term in
                        their name (default: False)
  --report REPORT, -R REPORT
                        Which information would you like to be in the report?
                        Mess with this if you are disappointed with the
                        default. Possibles are Description, Tags, Requires,
                        Provides, Status, and Resources. Add multiple of them
                        with commas (no whitespace). For example, if you
                        wanted Description and Resources, you would put here
                        Description,Resources (default: None)

anvi-import-collection

Import an external binning result into anvi'o

Example uses and other resources

Usage

usage: anvi-import-collection [-h] [-c CONTIGS_DB] [-p PAN_OR_PROFILE_DB] -C
                              COLLECTION_NAME [--bins-info BINS_INFO]
                              [--contigs-mode]
                              TAB DELIMITED FILE

Parameters

positional arguments:

  TAB DELIMITED FILE    The input file that describes bin IDs for each split
                        or contig.

optional arguments:

  -h, --help            show this help message and exit
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  --bins-info BINS_INFO
                        Additional information for bins. The file must contain
                        three TAB-delimited columns, where the first one must
                        be a unique bin name, the second should be a 'source',
                        and the last one should be a 7 character HTML color
                        code (i.e., '#424242'). Source column must contain
                        information about the origin of the bin. If these bins
                        are automatically identified by a program like
                        CONCOCT, this column could contain the program name
                        and version. The source information will be associated
                        with the bin in various interfaces so in a sense it is
                        not *that* critical what it says there, but on the
                        other hand it is, becuse we should also think about
                        people who may end up having to work with what we put
                        together later. (default: None)
  --contigs-mode        Use this flag if your binning was done on contigs
                        instead of splits. Please refer to the documentation
                        for help. (default: False)

anvi-import-functions

Parse and store functional annotation of genes

Example uses and other resources

Usage

usage: anvi-import-functions [-h] -c CONTIGS_DB [-p PARSER] -i FILES)
                             [FILE(S ...] [--drop-previous-annotations]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PARSER, --parser PARSER
                        Parser to make sense of the input files (if you need
                        one). There are currently 2 parsers readily available:
                        ['interproscan', 'AGNOSTOS']. IT IS OK if you do not
                        select a parser if you have a standard, TAB-delimited
                        input file for funcitonal annotation of genes. If this
                        is not like 2018 and everything is already outdated,
                        you should be able to go to this address and learn
                        everything you need like a boss:
                        http://merenlab.org/2016/06/18/importing-functions/
                        (default: None)
  -i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
                        One or more input files should follow this parameter.
                        The way these files will be handled will depend on
                        which parser you selected (if you did select any).
                        (default: None)
  --drop-previous-annotations
                        Use this flag if you want anvi'o to remove ALL
                        previous functional annotations for your genes, and
                        then import the new data. The default behavior will
                        add any annotation source into the db incrementally
                        unless there are already annotations from this source.
                        In which case, it will first remove previous
                        annotations for that source only (i.e., if source X is
                        both in the db and in the incoming annotations data,
                        it will replace the content of source X in the db).
                        (default: False)

anvi-import-items-order

Import a new items order into an anvi'o database

Usage

usage: anvi-import-items-order [-h] [-i FILE] [-p DB PATH] [--name ORDER NAME]
                               [--make-default]

Parameters

CRITICAL INPUT: Basically the input file and the target database

  -i FILE, --input-order FILE
                        One of the two important things you must provide: the
                        file that contains the items order. The format of this
                        file is important. It can either contain a proper
                        newick tree in it, or a complete list of 'items' in
                        the target database where every line of the file is
                        simply an item name. If you are providing a newick
                        tree, the entire file should be a single line. I know
                        it sounds hard, but you seriously can do this.
                        (default: None)
  -p DB PATH, --db-path DB PATH
                        An appropriate anvi'o database to import the items
                        order. Currently it can be a profile, pan, or genes
                        database. But you should try your chances with other
                        kinds of databases for fun and games. Basically, if
                        the database contains an items order table, then
                        things will work. Otherwise, you will probably get
                        angry errors back in the worst case scenario.
                        (default: None)

NOT SO CRITICAL INPUT: Because not all parameters are created equal

  --name ORDER NAME     What should we call this order? Give it a concise,
                        single-word name. (default: None)
  --make-default        You have the option to make this order the default
                        order in the database. Which means, anvi'o will use
                        this one when someone runs the program anvi-
                        interactive and presses draw. Big responsibility. But
                        if you have a 'default' state, it will not work
                        because the default items order in the state file
                        overwrites the one that comes from the database. So
                        not that big of a responsibility. (default: False)

anvi-import-misc-data

Populate additional data or order tables in pan or profile databases for items and layers, OR additional data in contigs databases for nucleotides and amino acids (the Swiss army knife-level serious stuff)

Example uses and other resources

Usage

usage: anvi-import-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB] -t
                             NAME [-D NAME] [--transpose] [--just-do-it]
                             TAB DELIMITED FILE

Parameters

positional arguments:

  TAB DELIMITED FILE    The input file that describes an additional data for
                        layers or items. The expected format of this file
                        depends on the data table you will target. This can
                        feel complicated, but we promise it is not (you
                        probably have a PhD or working on one, so trust us
                        when we say "it is not complicated"). You need to read
                        the online documentation if this is your first time
                        with this.

optional arguments:

  -h, --help            show this help message and exit

Database input: Provide 1 of these

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

Details: Everything else.

  -t NAME, --target-data-table NAME
                        The target table is the table you are interested in
                        accessing. Currently it can be 'items','layers', or
                        'layer_orders'. Please see most up-to-date online
                        documentation for more information. (default: None)
  -D NAME, --target-data-group NAME
                        Data group to focus. Anvi'o misc data tables support
                        associating a set of data keys with a data group. If
                        you have no idea what this is, then probably you don't
                        need it, and anvi'o will take care of you. Note: this
                        flag is IRRELEVANT if you are working with additional
                        order data tables. (default: None)
  --transpose           Transpose the input matrix file before clustering.
                        (default: False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-import-state

Import an anvi'o state into a profile database

Usage

usage: anvi-import-state [-h] -p PAN_OR_PROFILE_DB -s STATE_FILE -n STATE_NAME

Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -s STATE_FILE, --state STATE_FILE
                        JSON serializable anvi'o state file. (default: None)
  -n STATE_NAME, --name STATE_NAME
                        State name. (default: None)

anvi-import-taxonomy-for-genes

Import gene-level taxonomy into an anvi'o contigs database

Usage

usage: anvi-import-taxonomy-for-genes [-h] -c CONTIGS_DB [-p PARSER] -i FILES)
                                      [FILE(S ...] [--just-do-it]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PARSER, --parser PARSER
                        Parser to make sense of the input files. There are 3
                        parsers readily available: ['default_matrix',
                        'centrifuge', 'kaiju']. It is OK if you do not select
                        a parser, but in that case there will be no additional
                        contigs available except the identification of single-
                        copy genes in your contigs for later use. Using a
                        parser will not prevent the analysis of single-copy
                        genes, but make anvio more powerful to help you make
                        sense of your results. Please see the documentation,
                        or get in touch with the developers if you have any
                        questions regarding parsers. (default: None)
  -i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
                        Input file(s) for selected parser. Each parser (except
                        "blank") requires input files to process that you
                        generate before running anvio. Please see the
                        documentation for details. (default: None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-import-taxonomy-for-layers

Import layers-level taxonomy into an anvi'o additional layer data table in an anvi'o single-profile database

Usage

usage: anvi-import-taxonomy-for-layers [-h] -p PROFILE_DB [--parser PARSER] -i
                                       FILES) [FILE(S ...]
                                       [--min-abundance PERCENTAGE]

Parameters

optional arguments:

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  --parser PARSER       Parser to make sense of the input files. There are 1
                        parsers readily available: ['krakenuniq']. (default:
                        None)
  -i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
                        Input file(s) for selected parser. Each parser (except
                        "blank") requires input files to process that you
                        generate before running anvio. Please see the
                        documentation for details. (default: None)
  --min-abundance PERCENTAGE
                        Short read-based taxonomy can be extremely noisy.
                        Therefore, here we have defeault minimum percentage
                        cutoff of 0.1% to eliminate any taxon that occurs less
                        than that in a given input file. (default: 0.1)

anvi-init-bam

Sort/Index BAM files

Example uses and other resources

Usage

usage: anvi-init-bam [-h] [-o FILE_PATH] [-T NUM_THREADS] BAM_FILE

Parameters

positional arguments:

  BAM_FILE              BAM file to analyze

optional arguments:

  -h, --help            show this help message and exit
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

anvi-inspect

Start an anvi'o inspect interactive interface

Example uses and other resources

Usage

usage: anvi-inspect [-h] [-p PROFILE_DB] [-c CONTIGS_DB]
                    [--split-name SPLIT_NAME] [--hide-outlier-SNVs]
                    [-I IP_ADDR] [-P INT] [--server-only] [--just-do-it]

Parameters

DEFAULT INPUTS: The interactive interface can be started with anvi'o databases.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --split-name SPLIT_NAME
                        Split name. (default: None)

VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.

  --hide-outlier-SNVs   During profiling, anvi'o marks positions of single-
                        nucleotide variations (SNVs) that originate from
                        places in contigs where coverage values are a bit
                        'sketchy'. If you would like to avoid SNVs in those
                        positions of splits in applicable projects you can use
                        this flag, and the interface would hide SNVs that are
                        marked as 'outlier' (although it is clearly the best
                        to see everything, no one will judge you if you end up
                        using this flag) (plus, there may or may not be some
                        historical data on this here:
                        https://github.com/meren/anvio/issues/309). (default:
                        False)

SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)

GENERAL CONVENIENCE: From anvi'o developers to you.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-interactive

Start an anvi'o server for the interactive interface

Example uses and other resources

Usage

usage: anvi-interactive [-h] [-p PROFILE_DB] [-c CONTIGS_DB]
                        [-C COLLECTION_NAME] [--manual-mode] [-f FASTA file]
                        [-d VIEW_DATA] [-t NEWICK] [--items-order FLAT_FILE]
                        [-V ADDITIONAL_VIEW] [-A ADDITIONAL_LAYERS]
                        [-F FUNCTION ANNOTATION SOURCE] [--gene-mode]
                        [--inseq-stats] [-b BIN_NAME] [--view NAME]
                        [--title NAME]
                        [--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
                        [--show-all-layers] [--split-hmm-layers]
                        [--hide-outlier-SNVs] [--state-autoload NAME]
                        [--collection-autoload NAME] [--export-svg FILE_PATH]
                        [--show-views] [--skip-check-names] [-o DIR_PATH]
                        [--dry-run] [--show-states] [--list-collections]
                        [--skip-init-functions] [--skip-auto-ordering]
                        [--skip-news] [--distance DISTANCE_METRIC]
                        [--linkage LINKAGE_METHOD] [-I IP_ADDR] [-P INT]
                        [--browser-path PATH] [--read-only] [--server-only]
                        [--password-protected] [--user-server-shutdown]

Parameters

DEFAULT INPUTS: The interactive interface can be started with and without anvi'o databases. The default use assumes you have your profile and contigs database, however, it is also possible to start the interface using ad hoc input files. See 'MANUAL INPUT' section for required parameters.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        If you have a collection in your profile database, you
                        can use this flag to start the interactive interface
                        with a tree showing your bins in your collection,
                        instead of each split. This is very useful when you
                        have imported your external binning results into
                        anvi'o, and want to see the distribution of your bins
                        across samples. In these cases anvi'o will cluster
                        your bins and based on multiple metrics. Because this
                        particular clustering will be done on the fly within
                        anvi'o interactive class, you get to define a
                        disntance metric and a linkage method using --linkage
                        and --distance parameters if you want! (default: None)

MANUAL INPUTS: Mandatory input parameters to start the interactive interface without anvi'o databases.

  --manual-mode         Using this flag, you can run the interactive interface
                        in an ad hoc manner using input files you curated
                        instead of standard output files generated by an
                        anvi'o run. In the manual mode you will be asked to
                        provide a profile database. In this mode a profile
                        database is only used to store 'state' of the
                        interactive interface so you can reload your visual
                        settings when you re-analyze the same files again. If
                        the profile database you provide does not exist,
                        anvi'o will create an empty one for you. (default:
                        False)
  -f FASTA file, --fasta-file FASTA file
                        A FASTA-formatted input file. (default: None)
  -d VIEW_DATA, --view-data VIEW_DATA
                        A TAB-delimited file for view data (default: None)
  -t NEWICK, --tree NEWICK
                        NEWICK formatted tree structure (default: None)
  --items-order FLAT_FILE
                        A flat file that contains the order of items you wish
                        the display using the interactive interface. You may
                        want to use this if you have a specific order of items
                        in your mind, and do not want to display a tree in the
                        middle (or simply you don't have one). The file format
                        is simple: each line should have an item name, and
                        there should be no header. (default: None)

ADDITIONAL STUFF: Parameters to provide additional layers, views, or layer data.

  -V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
                        A TAB-delimited file for an additional view to be used
                        in the interface. This file should contain all split
                        names, and values for each of them in all samples.
                        Each column in this file must correspond to a sample
                        name. Content of this file will be called 'user_view',
                        which will be available as a new item in the 'views'
                        combo box in the interface (default: None)
  -A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
                        A TAB-delimited file for additional layers for splits.
                        The first column of this file must be split names, and
                        the remaining columns should be unique attributes. The
                        file does not need to contain all split names, or
                        values for each split in every column. Anvi'o will try
                        to deal with missing data nicely. Each column in this
                        file will be visualized as a new layer in the tree.
                        (default: None)
  -F FUNCTION ANNOTATION SOURCE, --annotation-source-for-per-split-summary FUNCTION ANNOTATION SOURCE
                        Using this parameter with a functional annotation
                        source that (1) is in the contigs database and (2) has
                        a maximum of 10 different function names, will
                        dynamically add a new layer to the intearctive
                        interface where proportions of functions in that
                        source will be shown per split as stacked bar charts.
                        (default: None)

GENE MODE: Gene mode related parameters.

  --gene-mode           Initiate the interactive interface in 'gene mode'. In
                        this mode, the items are genes (instead of splits of
                        contigs). The following views are available: detection
                        (the detection value of each gene in each sample). The
                        mean_coverage (the mean coverage of genes). The
                        non_outlier_mean_coverage (the mean coverage of the
                        non-outlier nucleotide positions of each gene in each
                        sample (median absolute deviation is used to remove
                        outliers per gene per sample)). The
                        non_outlier_coverage_std view (standard deviation of
                        the coverage of non-outlier positions of genes in
                        samples). You can also choose to order items and
                        layers according to each one of the aforementioned
                        views. In addition, all layer ordering that are
                        available in the regular mode (i.e. the full mode
                        where you have contigs/splits) are also available in
                        'gene mode', so that, for example, you can choose to
                        order the layers according to 'detection', and that
                        would be the order according to the detection values
                        of splits, whereas if you choose 'genes_detections'
                        then the order of layers would be according to the
                        detection values of genes. Inspection and sequence
                        functionality are available (through the right-click
                        menu), except now sequences are of the specific gene.
                        Inspection has now two options available: 'Inspect
                        Context', which brings you to the inspection page of
                        the split to which the gene belongs where the
                        inspected gene will be highlighted in yellow in the
                        bottom, and 'Inspect Gene', which opens the inspection
                        page only for the gene and 100 nts around each side of
                        it (the purpose of this option is to make the
                        inspection page load faster if you only want to look
                        at the nucleotide coverage of a specific gene).
                        NOTICE: You can't store states or collections in 'gene
                        mode'. However, you still can make fake selections,
                        and create fake bins for your viewing convenience only
                        (smiley). Search options are available, and you can
                        even search for functions if you have them in your
                        contigs database. ANOTHER NOTICE: loading this mode
                        might take a while if your bin has many genes, and
                        your profile database has many samples, this is
                        because the gene coverages stats are computed in an
                        ad-hoc manner when you load this mode, we know this is
                        not ideal and we plan to improve that (along with
                        other things). If you have suggestions/complaints
                        regarding this mode please comment on this github
                        issue: https://goo.gl/yHhRei. Please refer to the
                        online tutorial for more information. (default: False)
  --inseq-stats         Provide if working with INSeq/Tn-Seq genomic data.
                        With this, all gene level coverage stats will be
                        calculated using INSeq/Tn-Seq statistical methods.
                        (default: False)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)

VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.

  --view NAME           Start the interface with a pre-selected view. To see a
                        list of available views, use --show-views flag.
                        (default: None)
  --title NAME          Title for the interface. If you are working with a
                        RUNINFO dict, the title will be determined based on
                        information stored in that file. Regardless, you can
                        override that value using this parameter. (default:
                        None)
  --taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
                        The taxonomic level to use whenever relevant and/or
                        available. The default taxonomic level is t_genus, but
                        if you choose something specific, anvi'o will focus on
                        that whenever possible.
  --show-all-layers     When declared, this flag tells the interface to show
                        every additional layer even if there are no hits. By
                        default, anvi'o doesn't show layers if there are no
                        hits for any of your items. (default: False)
  --split-hmm-layers    When declared, this flag tells the interface to split
                        every gene found in HMM searches that were performed
                        against non-singlecopy gene HMM profiles into their
                        own layer. Please see the documentation for details.
                        (default: False)
  --hide-outlier-SNVs   During profiling, anvi'o marks positions of single-
                        nucleotide variations (SNVs) that originate from
                        places in contigs where coverage values are a bit
                        'sketchy'. If you would like to avoid SNVs in those
                        positions of splits in applicable projects you can use
                        this flag, and the interface would hide SNVs that are
                        marked as 'outlier' (although it is clearly the best
                        to see everything, no one will judge you if you end up
                        using this flag) (plus, there may or may not be some
                        historical data on this here:
                        https://github.com/meren/anvio/issues/309). (default:
                        False)
  --state-autoload NAME
                        Automatically load previous saved state and draw tree.
                        To see a list of available states, use --show-states
                        flag. (default: None)
  --collection-autoload NAME
                        Automatically load a collection and draw tree. To see
                        a list of available collections, use --list-
                        collections flag. (default: None)
  --export-svg FILE_PATH
                        The SVG output file path. (default: None)

SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).

  --show-views          When declared, the program will show a list of
                        available views, and exit. (default: False)
  --skip-check-names    For debugging purposes. You should never really need
                        it. (default: False)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)
  --show-states         When declared the program will print all available
                        states and exit. (default: False)
  --list-collections    Show available collections and exit. (default: False)
  --skip-init-functions
                        When declared, function calls for genes will not be
                        initialized (therefore will be missing from all
                        relevant interfaces or output files). The use of this
                        flag may reduce the memory fingerprint and processing
                        time for large datasets. (default: False)
  --skip-auto-ordering  When declared, the attempt to include automatically
                        generated orders of items based on additional data is
                        skipped. In case those buggers cause issues with your
                        data, and you still want to see your stuff and deal
                        with the other issue maybe later. (default: False)
  --skip-news           Don't try to read news content from upstream.
                        (default: False)
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        Only relevant if you are running the interactive
                        interface in "collection" mode. The default is
                        "euclidean".
  --linkage LINKAGE_METHOD
                        The linkage method for the hierarchical clustering.
                        Only relevant if you are running the interactive
                        interface in "collection" mode. The default is "ward".

SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --browser-path PATH   By default, anvi'o will use your default browser to
                        launch the interactive interface. If you would like to
                        use something else than your system default, you can
                        provide a full path for an alternative browser using
                        this parameter, and hope for the best. For instance we
                        are using this parameter to call Google's experimental
                        browser, Canary, which performs better with demanding
                        visualizations. (default: None)
  --read-only           When the interactive interface is started with this
                        flag, all 'database write' operations will be
                        disabled. (default: False)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)
  --password-protected  If this flag is set, command line tool will ask you to
                        enter a password and interactive interface will be
                        only accessible after entering same password. This
                        option is recommended for shared machines like
                        clusters or shared networks where computers are not
                        isolated. (default: False)
  --user-server-shutdown
                        Allow users to shutdown an anvi'server via web
                        interface. (default: False)

anvi-matrix-to-newick

Takes a distance matrix, returns a newick tree

Example uses and other resources

Usage

usage: anvi-matrix-to-newick [-h] [-o FILE_PATH]
                             [--items-order-file FILE PATH] [--transpose]
                             [--distance DISTANCE_METRIC]
                             [--linkage LINKAGE_METHOD]
                             PATH

Parameters

INPUT: The data you wish to cluster

  PATH                  Input matrix

OUTPUT: How would you like your results to be reported?

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --items-order-file FILE PATH
                        In addition to a newick formatted output file, you can
                        ask anvi'o to report the order of items in the
                        resulting tree in a separate file. The content of this
                        file will be a single-column item names the way they
                        are ordered in the output newick dendrogram. (default:
                        None)

SWEETS: Additional options

  --transpose           Transpose the input matrix file before clustering.
                        (default: False)
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        The default distance metric is 'euclidean'. You can
                        find the full list of distance metrics either by
                        making a mistake (such as entering a non-existent
                        distance metric and making anvi'o upset), or by taking
                        a look at the help menu of the
                        hierarchy.distance.pdist function in the scipy.cluster
                        module.
  --linkage LINKAGE_METHOD
                        The linkage method for the hierarchical clustering.
                        The default linkage method is 'ward', because that is
                        the best one. It really is. We talked to a lot of
                        people and they were all like 'this is the best one
                        available' and it is just all out there. Honestly it
                        is so good that we will build a wall around it and
                        make other linkage methods pay for it. But if you want
                        to see a full list of available ones you can check the
                        hierarcy.linkage function in the scipy.cluster module.
                        Up to you really. But then you can't use ward anymore,
                        and you would have to leave anvi'o right now.

anvi-mcg-classifier

A program to classify genes according to coverage across multiple metagenomes

Usage

usage: anvi-mcg-classifier [-h] -p PROFILE_DB -c CONTIGS_DB
                           [-O FILENAME_PREFIX] [-C COLLECTION_NAME]
                           [-b BIN_NAME] [-B FILE_PATH]
                           [--exclude-samples FILE] [--include-samples FILE]
                           [--gen-figures] [--get-samples-stats-only] [-W]
                           [--alpha NUM] [--outliers-threshold NUM]
                           [--zeros-are-outliers]

Parameters

ESSENTIAL INPUTS: You must supply a merged profile db (along with a matching contigs db)

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

ESSENTIAL OUTPUTS: The outputs of the algorithm are: an anvio additional layers format file with the classification information for genes. An anvio samples information file with detectino information per sample. In addition, when a profile database is given then a gene-coverages, and gene-detection tables would also be saved. All files are created with the prefix that is provided by the user.

  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)

ADDITIONAL STUFF: Parameters to provide pre-existing additional layers, samples-information files, so that the outputs would be added to these files

  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)
  --exclude-samples FILE
                        List of samples to exclude for the analysis. (default:
                        None)
  --include-samples FILE
                        List of samples to include for the analysis. (default:
                        None)
  --gen-figures         For those of you who wish to dig deeper, a collection
                        of figures could be created to allow you to get
                        insight into how the classification was generated.
                        This is especially useful to identify cases in which
                        you shouldn't trust the classification (for example
                        due to a large number of outliers). NOTICE: if you ask
                        anvi'o to generate these figures then it will
                        significantly extend the execution time. To learn
                        about which figures are created and what they mean,
                        contact your nearest anvi'o developer, because
                        currently it is a well-hidden secret. (default: False)
  --get-samples-stats-only
                        If you only wish to get statistics regarding the
                        occurrence of bins in samples, then use this flag.
                        Especially when dealing with many samples or large
                        genomes, gene stats could be a long time to compute.
                        By using this flag you could save a lot of computation
                        time. (default: False)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)

PARAMETERS: Parameters to determine cut-offs for the gene-classifier

  --alpha NUM, --genome-detection-uncertainty NUM
                        Determines the range of sample detection values that
                        are considered negative, ambiguous or positive. Min of
                        0 and smaller than 0.5, default of 0.25. For exmaple
                        for the default samples with detection below 0.5-0.25
                        = 0.25 will be considered negative (i.e. donot contain
                        the genome), samples with detection between 0.25 and
                        0.75 would be ambiguous (and hence would not be used
                        for the classification), and samples with detection
                        above 0.75 would be considered positive (i.e. contain
                        the genome). (default: 0.25)
  --outliers-threshold NUM
                        Threshold to use for the outlier detection. The
                        default value is '1.5'. Absolute deviation around the
                        median is used. To read more about the method please
                        refer to: 'How to Detect and Handle Outliers' by Boris
                        Iglewicz and David Hoaglin
                        (doi:10.1016/j.jesp.2013.03.013).
  --zeros-are-outliers  If you want all zero coverage positions to be treated
                        like outliers then use this flag. The reason to treat
                        zero coverage as outliers is because when mapping
                        reads to a reference we could get many zero positions
                        due to accessory genes. These positions then skew the
                        average values that we compute. (default: False)

anvi-merge

Merge multiple anvio profiles

Example uses and other resources

Usage

usage: anvi-merge [-h] -c CONTIGS_DB [-o DIR_PATH] [-S NAME]
                  [--description TEXT_FILE] [--skip-hierarchical-clustering]
                  [--enforce-hierarchical-clustering]
                  [--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD] [-W]
                  SINGLE_PROFILES) [SINGLE_PROFILE(S ...]

Parameters

positional arguments:

  SINGLE_PROFILE(S)     Anvi'o single profiles to merge

optional arguments:

  -h, --help            show this help message and exit
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -S NAME, --sample-name NAME
                        It is important to set a sample name (using only ASCII
                        letters and digits and without spaces) that is unique
                        (considering all others). If you do not provide one,
                        anvi'o will try to make up one for you based on other
                        information (although, you should never let the
                        software decide these things). (default: None)
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)
  --skip-hierarchical-clustering
                        If you are not planning to use the interactive
                        interface (or if you have other means to add a tree of
                        contigs in the database) you may skip the step where
                        hierarchical clustering of your items are preformed
                        based on default clustering recipes matching to your
                        database type. (default: False)
  --enforce-hierarchical-clustering
                        If you have more than 25,000 splits in your merged
                        profile, anvi-merge will automatically skip the
                        hierarchical clustering of splits (by setting --skip-
                        hierarchical-clustering flag on). This is due to the
                        fact that computational time required for hierarchical
                        clustering increases exponentially with the number of
                        items being clustered. Based on our experience we
                        decided that 25,000 splits is about the maximum we
                        should try. However, this is not a theoretical limit,
                        and you can overwrite this heuristic by using this
                        flag, which would tell anvi'o to attempt to cluster
                        splits regardless. (default: False)
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        If you do not use this flag, the default distance
                        metric will be used for each clustering configuration
                        which is "euclidean". (default: None)
  --linkage LINKAGE_METHOD
                        The same story with the `--distance`, except, the
                        system default for this one is ward. (default: None)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)

anvi-merge-bins

Merge a given set of bins in an anvi'o collection

Usage

usage: anvi-merge-bins [-h] -p PAN_OR_PROFILE_DB [-C COLLECTION_NAME]
                       [-b BIN NAMES] [-B BIN NAME] [--list-collections]
                       [--list-bins]

Parameters

DB AND COLLECTION: Simple enough. This guy needs a pan or profile database and a collection name. You can get a list of available collections with another flag down below.

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)

BINS TO WORK WITH: Here you need to define a list of bin names to merge, and the new bin name for them to merge under. Your bin names should be comma-separated. Both 'name_1, name_2, name_3' and name_1,name_2,name_3 will work. Your new bin name better be a single word, meaningful name so anvi'o does not complain about it later.

  -b BIN NAMES, --bin-names-list BIN NAMES
                        Comma-separated list of bin names. (default: None)
  -B BIN NAME, --new-bin-name BIN NAME
                        The new bin name. (default: None)

SWEET FLAGS OF CONVENIENCE: We gotchu.

  --list-collections    Show available collections and exit. (default: False)
  --list-bins           List available bins in a collection and exit.
                        (default: False)

anvi-merge-trnaseq

This program processes one or more anvi'o tRNA-seq databases produced by anvi-trnaseq and outputs anvi'o contigs and merged profile databases accessible to other tools in the anvi'o ecosystem. Final tRNA "seed sequences" are determined from a set of samples. Each sample yields a set of tRNA predictions stored in a tRNA-seq database, and these tRNAs may be shared among the samples. tRNA may be 3' fragments and thereby subsequences of longer tRNAs from other samples which would become seeds. The profile database produced by this program records the coverages of seeds in each sample. This program finalizes predicted nucleotide modification sites using tunable substitution rate parameters.

Usage

usage: anvi-merge-trnaseq [-h] [-o DIR_PATH] [-n PROJECT_NAME]
                          [-T NUM_THREADS] [--max-reported-trna-seeds INT]
                          [-W] [--description TEXT_FILE]
                          [--feature-threshold {acceptor_stem,fiveprime_acceptor_stem_sequence,position_8,position_9,d_arm,d_stem,fiveprime_d_stem_sequence,d_loop,threeprime_d_stem_sequence,position_26,anticodon_arm,anticodon_stem,fiveprime_anticodon_stem_sequence,anticodon_loop}]
                          [--preferred-treatment PREFERRED_TREATMENT]
                          [--nonspecific-output NONSPECIFIC_OUTPUT]
                          [--min-variation FLOAT]
                          [--min-third-fourth-nt FLOAT]
                          [--min-indel-fraction FLOAT]
                          [--distance DISTANCE_METRIC]
                          [--linkage LINKAGE_METHOD]
                          TRNASEQ_DBS) [TRNASEQ_DB(S ...]

Parameters

MANDATORY: TRNASEQ_DB(S) Anvi'o tRNA-seq databases representing samples in an experiment -o DIR_PATH, –output-dir DIR_PATH Directory path for output files (default: None) -n PROJECT_NAME, –project-name PROJECT_NAME Name of the project. Please choose a short but descriptive name (so anvi'o can use it whenever she needs to name an output file, or add a new table in a database, or name her first born). (default: None)

EXTRAS:
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --max-reported-trna-seeds INT
                        This parameter limits the number of tRNA seed
                        sequences reported in the contigs database, as anvi-
                        interactive can have trouble displaying large numbers
                        of items. To remove the limit on reported seeds,
                        specify a value of -1. (default: 10000)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)

ADVANCED:

  --feature-threshold {acceptor_stem,fiveprime_acceptor_stem_sequence,position_8,position_9,d_arm,d_stem,fiveprime_d_stem_sequence,d_loop,threeprime_d_stem_sequence,position_26,anticodon_arm,anticodon_stem,fiveprime_anticodon_stem_sequence,anticodon_loop}
                        This option prevents formation of tRNA seed sequences
                        from input sequences that did not reach the threshold
                        feature in anvi-trnaseq profiling from the 3' end. The
                        more stringent the threshold, the fewer spurious seeds
                        are formed from rare chimeric and other inaccurate
                        tRNA predictions. The most stringent threshold is
                        "acceptor_stem", the most 5' feature, resulting in
                        seeds formed only from tRNAs with a complete feature
                        set (with the exception of the extra 5'-G in tRNA-
                        His). (default: anticodon_loop)
  --preferred-treatment PREFERRED_TREATMENT
                        tRNA-seq databases recorded as employing the preferred
                        treatment are given preference in setting nucleotides
                        at predicted modification positions in tRNA seed
                        sequences. By default, equal preference is given to
                        all of the input databases. The reason for this
                        parameter is that paired untreated and enzymatically
                        treated splits can assist in the identification of
                        underlying modified nucleotides. For example, splits
                        treated with a demethylase can be compared to
                        untreated splits to probe which nucleotides are
                        methylated. (default: None)
  --nonspecific-output NONSPECIFIC_OUTPUT
                        A significant fraction of tRNA-seq reads can be from
                        tRNA fragments. These can be real biomolecules or
                        artifactual 3' fragments produced as a result of
                        incomplete reverse transcription of the tRNA template
                        to cDNA. Rather than randomly assigning fragments to a
                        single target, as in metagenomic read recruitment by
                        Bowtie, anvi-trnaseq tracks all of the longer
                        sequences containing each fragment. This results in
                        two categories of coverage: 'specific' for reads that
                        are only found in one seed and 'nonspecific' for reads
                        found in multiple seeds. Specific coverages are always
                        reported in a separate profile database. Nonspecific
                        coverages can be reported in three types of database,
                        as specified by this parameter. 'nonspecific_db'
                        produces a profile database only containing
                        nonspecific coverages. 'combined_db' produces a
                        database containing separate specific and nonspecific
                        layers. 'summed_db' produces a database containing
                        summed specific and nonspecific coverages. To produce
                        multiple types of databases, separate the database
                        types with commas (no spaces). For example, all three
                        databases are produced with the argument,
                        'nonspecific_db,combined_db,summed_db'. (default:
                        nonspecific_db,combined_db)
  --min-variation FLOAT
                        When more than 2 nucleotides are found at a position
                        in a tRNA, a modification-induced mutation
                        (substitution) is considered rather than a single
                        nucleotide variant. This parameter sets a key
                        criterion for the prediction of a modification, the
                        minimum fraction of specific coverage at a position
                        with more than 2 nucleotides that must be contributed
                        by nucleotides beside the most abundant nucleotide.
                        For example, if A, C, and G are found at position 20
                        of a tRNA, and A is represented by 95 reads, C by 3
                        reads, and G by 1 read, then with a parameter value of
                        0.05, the site would be 1 C, G, or T short of meeting
                        the threshold for prediction of a modification.
                        (default: 0.01)
  --min-third-fourth-nt FLOAT
                        This parameter sets a key criterion for the prediction
                        of a modification, the minimum fraction of specific
                        coverage at a position with more than 2 nucleotides
                        that must be contributed by nucleotides beside the 2
                        most abundant nucleotides. Unlike --min-variation,
                        this criterion only needs to be met for 1 sample to
                        permit modification of the position in all samples of
                        the experiment. For example, consider an experiment
                        with 2 samples and a parameter value of 0.01. In
                        Sample 1, A, C, and G are found at position 20 of a
                        tRNA, and A is represented by 95 reads, C by 4 reads,
                        and G by 1 read. The default parameter value of 0.01
                        is exactly met at the position thanks to G. In Sample
                        2, A, C, G, and T are found at position 20 of the same
                        tRNA seed, and A is represented by 1000 reads, C by
                        100 reads, G by 2 reads, and T by 2 reads. The third
                        and fourth nucleotides don't meet the coverage
                        threshold of 0.01, but this is irrelevant for calling
                        the modification, since Sample 1 met the criterion.
                        There is an important consideration due to the way
                        this threshold is currently imposed. Potential
                        modification sites that do not meet the threshold are
                        not treated like single nucleotide variants in anvi-
                        trnaseq: they do not cause the seed sequence to be
                        split such that no seed contains a variant that was
                        not deemed to be a modification. Rather, candidate
                        modification positions that do not meet this threshold
                        are retained in the seed BUT NOT REPORTED. Therefore,
                        we recommend rerunning this command with a parameter
                        value of 0 to inspect seeds for undisplayed variants
                        (possible SNVs) with a low level of third and fourth
                        nucleotides. (default: 0.002)
  --min-indel-fraction FLOAT
                        This parameter controls which indels are reported in
                        the tRNA-seq profile database. Coverage of an indel in
                        a sample must meet the minimum fraction of specific
                        coverage. Indel coverages are calculated separately
                        for specific, nonspecific, and summed coverages.
                        (default: 0.001)
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        The default distance metric is 'euclidean'. You can
                        find the full list of distance metrics either by
                        making a mistake (such as entering a non-existent
                        distance metric and making anvi'o upset), or by taking
                        a look at the help menu of the
                        hierarchy.distance.pdist function in the scipy.cluster
                        module.
  --linkage LINKAGE_METHOD
                        The linkage method for the hierarchical clustering.
                        The default linkage method is 'ward', because that is
                        the best one. It really is. We talked to a lot of
                        people and they were all like 'this is the best one
                        available' and it is just all out there. Honestly it
                        is so good that we will build a wall around it and
                        make other linkage methods pay for it. But if you want
                        to see a full list of available ones you can check the
                        hierarcy.linkage function in the scipy.cluster module.
                        Up to you really. But then you can't use ward anymore,
                        and you would have to leave anvi'o right now.

anvi-meta-pan-genome

Convert a pangenome into a metapangenome

Usage

usage: anvi-meta-pan-genome [-h] -p PAN_DB [-g GENOMES_STORAGE] [-i FILE]
                            [--fraction-of-median-coverage FLOAT]
                            [--min-detection FLOAT]

Parameters

PANGENOME: Files for the pangenome.

  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

METAGENOME: Genome bins stored in an anvi'o profile databases as collections.

  -i FILE, --internal-genomes FILE
                        A four-column TAB-delimited flat text file. This file
                        should be identical to the internal genomes file you
                        used for your pangenomics analysis. Anvi'o will use
                        this file to find all profile and contigs databases
                        that contain the information for each gene and genome
                        across metagenomes. (default: None)

CRITERION FOR DETECTION: This is tricky. What we want to do is to identify genes that are occurring uniformly across samples.

  --fraction-of-median-coverage FLOAT
                        The value set here will be used to remove a gene if
                        its total coverage across environments is less than
                        the median coverage of all genes multiplied by this
                        value. The default is 0.25, which means, if the median
                        total coverage of all genes across all samples is
                        100X, then, a gene with a total coverage of less than
                        25X across all samples will be assumed not a part of
                        the 'environmental core'. (default: 0.25)
  --min-detection FLOAT
                        For this entire thing to work, the genome you are
                        focusing on should be detected in at least one
                        metagenome. If that is not the case, it would mean
                        that you do not have any sample that represents the
                        niche for this organism (or you do not have enough
                        depth of coverage) to investigate the detection of
                        genes in the environment. By default, this script
                        requires at least '0.5' of the genome to be detected
                        in at least one metagenome. This parameter allows you
                        to change that. 0 would mean no detection test
                        required, 1 would mean the entire genome must be
                        detected. (default: 0.5)

anvi-migrate

Migrate an anvi'o database or config file to a newer version

Usage

usage: anvi-migrate [-h] [--migrate-dbs-safely] [--migrate-dbs-quickly]
                    [--just-do-it] [-t VERSION]
                    DATABASES) [DATABASE(S ...]

Parameters

INPUTS: You will literally give us any anvi'o database.

  DATABASE(S)           Anvi'o database or config file for migration. You can
                        give many of them all at once. Running `anvi-migrate
                        *.db` in a directory will migrate all databases in
                        that directory.

SAFETY: It is up to you. Safe things take much longer and boring. Unsafe things are fast, fun, and .. well, don't come to use if your computer loses power or somiething.

  --migrate-dbs-safely  If you chose this, anvi'o will first create a copy of
                        your original database. If something goes wrong, it
                        will restore the original. If everything works, it
                        will remove the old copy. IF YOU HAVE DATABASES THAT
                        ARE VERY LARGE OR IF YOU ARE MIGRATING MANY MANY OF
                        THEM THIS OPTION WILL ADD A HUGE I/O BURDEN ON YOUR
                        SYSTEM. But still. Safety is safe. (default: False)
  --migrate-dbs-quickly
                        If you chose this, anvi'o will migrate your databases
                        in place. It will be much faster (and arguably more
                        fun) than the safe option, but if something goes
                        wrong, you will lose data. During the first five years
                        of anvi'o development not a single user lost data
                        using our migration scripts as far as we know. But
                        there is always a first, and today might be your lucky
                        day. (default: False)

PARAMETERS OF CONVENIENCE: This is how anvi'o spoils you.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  -t VERSION, --target-version VERSION
                        Anvi'o will stop upgrading your database when it
                        reaches to this version. (default: None)

anvi-oligotype-linkmers

Takes an anvi'o linkmers report, generates an oligotyping output

Example uses and other resources

Usage

usage: anvi-oligotype-linkmers [-h] -i LINKMER_REPORT -o DIR_PATH

Parameters

optional arguments:

  -i LINKMER_REPORT, --input-file LINKMER_REPORT
                        Output file of `anvi-report-linkmers`. (default: None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)

anvi-pan-genome

An anvi'o program to compute a pangenome from an anvi'o genome storage

Example uses and other resources

Usage

usage: anvi-pan-genome [-h] -g GENOMES_STORAGE [-G GENOME_NAMES]
                       [--skip-alignments] [--skip-homogeneity]
                       [--quick-homogeneity] [--align-with ALIGNER]
                       [--exclude-partial-gene-calls] [--use-ncbi-blast]
                       [--minbit MINBIT] [--mcl-inflation INFLATION]
                       [--min-occurrence NUM_OCCURRENCE]
                       [--min-percent-identity PERCENT] [--sensitive]
                       [-n PROJECT_NAME] [--description TEXT_FILE]
                       [-o PAN_DB_DIR] [-W] [-T NUM_THREADS]
                       [--skip-hierarchical-clustering]
                       [--enforce-hierarchical-clustering]
                       [--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]

Parameters

GENOMES: The very fancy genomes storage file. This file is generated by the program anvi-genomes-storage. Please see the online tutorial on pangenomic workflow if you don't know how to generate one.

  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)
  -G GENOME_NAMES, --genome-names GENOME_NAMES
                        Genome names to 'focus'. You can use this parameter to
                        limit the genomes included in your analysis. You can
                        provide these names as a comma-separated list of
                        names, or you can put them in a file, where you have a
                        single genome name in each line, and provide the file
                        path. (default: None)

PARAMETERS: Important stuff Tom never pays attention (but you should).

  --skip-alignments     By default, anvi'o attempts to align amino acid
                        sequences in each gene cluster using multiple sequnce
                        alignment via muscle. You can use this flag to skip
                        that step and be upset later. (default: False)
  --skip-homogeneity    By default, anvi'o attempts to calculate homogeneity
                        values for every gene cluster, given that they are
                        aligned. You can use this flag to have anvi'o skip
                        homogeneity calculations. Anvi'o will ignore this flag
                        if you decide to skip alignments (default: False)
  --quick-homogeneity   By default, anvi'o will use a homogeneity algorithm
                        that checks for horizontal and vertical geometric
                        homogeneity (along with functional). With this flag,
                        you can tell anvi'o to skip horizontal geometric
                        homogeneity calculations. It will be less accurate but
                        quicker. Anvi'o will ignore this flag if you skip
                        homogeneity calculations or alignments all together.
                        (default: False)
  --align-with ALIGNER  The multiple sequence alignment program to use when
                        multiple sequence alignment is necessary. To see all
                        available options, use the flag `--list-aligners`.
                        (default: None)
  --exclude-partial-gene-calls
                        By default, anvi'o includes all partial gene calls
                        from the analysis, which, in some cases, may inflate
                        the number of gene clusters identified and introduce
                        extra heterogeneity within those gene clusters. Using
                        this flag, you can request anvi'o to exclude partial
                        gene calls from the analysis (whether a gene call is
                        partial or not is an information that comes directly
                        from the gene caller used to identify genes during the
                        generation of the contigs database). (default: False)
  --use-ncbi-blast      This program uses DIAMOND by default, however, if you
                        like, you can use good ol' blastp from NCBI instead.
                        (default: False)
  --minbit MINBIT       The minimum minbit value. The minbit heuristic
                        provides a mean to set a to eliminate weak matches
                        between two amino acid sequences. We learned it from
                        ITEP (Benedict MN et al, doi:10.1186/1471-2164-15-8),
                        which is a comprehensive analysis workflow for
                        pangenomes, and decided to use it in the anvi'o
                        pangenomic workflow, as well. Briefly, If you have two
                        amino acid sequences, 'A' and 'B', the minbit is
                        defined as 'BITSCORE(A, B) / MIN(BITSCORE(A, A),
                        BITSCORE(B, B))'. So the minbit score between two
                        sequences goes to 1 if they are very similar over the
                        entire length of the 'shorter' amino acid sequence,
                        and goes to 0 if (1) they match over a very short
                        stretch compared even to the length of the shorter
                        amino acid sequence or (2) the match betwen sequence
                        identity is low. The default is 0.5.
  --mcl-inflation INFLATION
                        MCL inflation parameter, that defines the sensitivity
                        of the algorithm during the identification of the gene
                        clusters. More information on this parameter and it's
                        effect on cluster granularity is here:
                        (http://micans.org/mcl/man/mclfaq.html#faq7.2). The
                        default is 2.
  --min-occurrence NUM_OCCURRENCE
                        Do you not want singletons? You don't? Well, this
                        parameter will help you get rid of them (along with
                        doubletons, if you want). Anvi'o will remove gene
                        clusters that occur less than the number you set using
                        this parameter from the analysis. The default is 1,
                        which means everything will be kept. If you want to
                        remove singletons, set it to 2, if you want to remove
                        doubletons as well, set it to 3, and so on.
  --min-percent-identity PERCENT
                        Minimum percent identity between the two amino acid
                        sequences for them to have an edge for MCL analysis.
                        This value will be used to filter hits from Diamond
                        search results. Because percent identity is not a
                        predictor of a good match (since it does not
                        communicate many other important factors such as the
                        alignment length between the two sequences and its
                        proportion to the entire length of those involved), we
                        suggest you rely on 'minbit' parameter. But you know
                        what? Maybe you shouldn't listen to anyone, and
                        experiment on your own! The default is 0 percent.
  --sensitive           DIAMOND sensitivity. With this flag you can instruct
                        DIAMOND to be 'sensitive', rather than 'fast' during
                        the search. It is likely the search will take
                        remarkably longer. But, hey, if you are doing it for
                        your final analysis, maybe it should take longer and
                        be more accurate. This flag is only relevant if you
                        are running DIAMOND. (default: False)

OTHERS: Sweet parameters of convenience.

  -n PROJECT_NAME, --project-name PROJECT_NAME
                        Name of the project. Please choose a short but
                        descriptive name (so anvi'o can use it whenever she
                        needs to name an output file, or add a new table in a
                        database, or name her first born). (default: None)
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)
  -o PAN_DB_DIR, --output-dir PAN_DB_DIR
                        Directory path for output files (default: None)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

ORGANIZING GENE CLUSTERs: These are stuff that will change the clustering dendrogram of your gene clusters.

  --skip-hierarchical-clustering
                        Anvi'o attempts to generate a hierarchical clustering
                        of your gene clusters once it identifies them so you
                        can use `anvi-display-pan` to play with it. But if you
                        want to skip this step, this is your flag. (default:
                        False)
  --enforce-hierarchical-clustering
                        If you want anvi'o to try to generate a hierarchical
                        clustering of your gene clusters even if the number of
                        gene clusters exceeds its suggested limit for
                        hierarchical clustering, you can use this flag to
                        enforce it. Are you are a rebel of some sorts? Or did
                        computers made you upset? Express your anger towards
                        machine using this flag. (default: False)
  --distance DISTANCE_METRIC
                        The distance metric for the clustering of gene
                        clusters. If you do not use this flag, the default
                        distance metric will be used for each clustering
                        configuration which is "euclidean". (default: None)
  --linkage LINKAGE_METHOD
                        The same story with the `--distance`, except, the
                        system default for this one is ward. (default: None)

anvi-plot-trnaseq

A program to write plots of coverage and modification data from flexible groups of tRNA-seq seeds

Usage

usage: anvi-plot-trnaseq [-h] -c CONTIGS_DB --seeds-specific-txt TEXT_FILE
                         --modifications-txt TEXT_FILE [-o DIR_PATH]

Parameters

MANDATORY:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --seeds-specific-txt TEXT_FILE, -s TEXT_FILE
                        A tab-delimited text file containing data on tRNA
                        seeds including specific coverages. `anvi-tabulate-
                        trnaseq` generates this file from anvi'o tRNA-seq
                        databases. (default: None)
  --modifications-txt TEXT_FILE, -m TEXT_FILE
                        A tab-delimited text file containing modification data
                        on tRNA seeds. `anvi-tabulate-trnaseq` generates this
                        file from anvi'o tRNA-seq databases. (default: None)

OPTIONAL:

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)

anvi-profile

The flagship anvi'o proram to profile a BAM file. Running this program on a BAM file will quantify coverages per nucleotide position in read recruitment results and will average coverage and detection data per contig. It will also calculate single-nucleotide, single-codon, and single-amino acid variants, as well as structural variants, such as insertion and deletions, to eventually stores all data into a single anvi'o profile database. For very large projects, this program can demand a lot of time, memory, and storage resources. If all you want is to learn coverages of your nutleotides, genes, contigs, or your bins collections from BAM files very rapidly, and/or you do not need anvi'o single profile databases for your project, please see other anvi'o programs that profile BAM files, anvi-script-get-coverage-from-bam and anvi-profile-blitz

metagenomics profile_db contigs_db bam variability clustering

Example uses and other resources

Usage

usage: anvi-profile [-h] [-i INPUT_BAM] [-c CONTIGS_DB] [--blank-profile]
                    [--min-percent-identity PERCENT_IDENTITY]
                    [--fetch-filter FILTER] [-o DIR_PATH] [-W] [-S NAME]
                    [--report-variability-full] [--skip-SNV-profiling]
                    [--skip-INDEL-profiling] [--profile-SCVs]
                    [--description TEXT_FILE] [--cluster-contigs]
                    [--skip-hierarchical-clustering]
                    [--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]
                    [-M INT] [--max-contig-length INT] [-X INT] [-V INT]
                    [--list-contigs] [--contigs-of-interest FILE]
                    [-T NUM_THREADS] [--queue-size INT]
                    [--write-buffer-size-per-thread INT] [--force-multi]

Parameters

INPUTS: There are two possible inputs for anvio profiler. You must to declare either of these two.

  -i INPUT_BAM, --input-file INPUT_BAM
                        Sorted and indexed BAM file to analyze. Takes a long
                        time depending on the length of the file and
                        parameters used for profiling. (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --blank-profile       If you only have contig sequences, but no mapping data
                        (i.e., you found a genome and would like to take a
                        look from it), this flag will become very handy. After
                        creating a contigs database for your contigs, you can
                        create a blank anvi'o profile database to use anvi'o
                        interactive interface with that contigs database
                        without any mapping data. (default: False)

FILTERS: Choose which reads to work (or not to work) with like a pro.

  --min-percent-identity PERCENT_IDENTITY
                        Ignore any reads with a percent identity to the
                        reference less than this number, e.g. 95. If not
                        provided, all reads in the BAM file will be used (and
                        things will run faster). (default: None)
  --fetch-filter FILTER
                        By default, anvi'o fetches all reads from a BAM file.
                        Once a read is 'fetched', some reads may be excluded
                        if you have used parameters such as `--min-percent-
                        identity`. But the `--fetch-filter` is different as it
                        determines WHICH reads from a BAM file will be used
                        for profiling at all. You can do a lot of fun things
                        with this parameter. For details, please read the
                        online documentation for `anvi-profile` using the URL
                        you should see at the end of the `--help` output on
                        your terminal. The known filters are the following:
                        double-forwards, double-reverses, inversions, single-
                        mapped-reads, distant-pairs-1K. (default: None)

EXTRAS: Things that are not mandatory, but can be useful if/when declared.

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)
  -S NAME, --sample-name NAME
                        It is important to set a sample name (using only ASCII
                        letters and digits and without spaces) that is unique
                        (considering all others). If you do not provide one,
                        anvi'o will try to make up one for you based on other
                        information (although, you should never let the
                        software decide these things). (default: None)
  --report-variability-full
                        One of the things anvi-profile does is to store
                        information about variable nucleotide positions
                        (SNVs). Usually it does not report every variable
                        position, since not every variable position is genuine
                        variation. Say, if you have 1,000 coverage, and all
                        nucleotides at that position are Ts and only one of
                        them is a C, the confidence of that C being a real
                        variation is quite low. anvi'o has a simple algorithm
                        in place to reduce the impact of noise. However, using
                        this flag you can disable it and ask profiler to
                        report every single variation (which may result in
                        very large output files and millions of reports, but
                        you are the boss). Do not forget to take a look at '--
                        min-coverage-for-variability' parameter. Also note
                        that this flag controls indel reporting: normally '--
                        min-coverage-for-variability' and internal anvi'o
                        heuristics control whether or not indels should be
                        reported, but with this flag all indels are reported.
                        (default: False)
  --skip-SNV-profiling  By default, anvi'o characterizes single-nucleotide
                        variation in each sample. The use of this flag will
                        instruct profiler to skip that step. Please remember
                        that parameters and flags must be identical between
                        different profiles using the same contigs database for
                        them to merge properly. (default: False)
  --skip-INDEL-profiling
                        The alignment of a read to a reference genome/sequence
                        can be imperfect, such that the read exhibits
                        insertions or deletions relative to the reference.
                        Anvi'o normally stores this information in the profile
                        database since the time taken and extra storage do not
                        amount to much, but if you insist on not having this
                        information, you can skip storing this information by
                        providing this flag. Note: If --skip-SNV-profiling is
                        provided, --skip-INDEL-profiling will automatically be
                        enforced. (default: False)
  --profile-SCVs        Anvi'o can perform accurate characterization of codon
                        frequencies in genes during profiling. While having
                        codon frequencies opens doors to powerful evolutionary
                        insights in downstream analyses, due to its
                        computational complexity, this feature comes 'off' by
                        default. Using this flag you can rise against the
                        authority, as you always should, and make anvi'o
                        profile codons. (default: False)
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)

HIERARCHICAL CLUSTERING: Do you want your splits to be clustered? Yes? No? Maybe? Remember: By default, anvi-profile will not perform hierarchical clustering on your splits; but if you use --blank flag, it will try. You can skip that by using the --skip-hierarchical-clustering flag.

  --cluster-contigs     Single profiles are rarely used for genome binning or
                        visualization, and since clustering step increases the
                        profiling runtime for no good reason, the default
                        behavior is to not cluster contigs for individual
                        runs. However, if you are planning to do binning on
                        one sample, you must use this flag to tell anvi'o to
                        run cluster configurations for single runs on your
                        sample. (default: False)
  --skip-hierarchical-clustering
                        If you are not planning to use the interactive
                        interface (or if you have other means to add a tree of
                        contigs in the database) you may skip the step where
                        hierarchical clustering of your items are preformed
                        based on default clustering recipes matching to your
                        database type. (default: False)
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        Only relevant if you are using `--cluster-contigs`
                        flag. The default is "euclidean".
  --linkage LINKAGE_METHOD
                        The linkage method for the hierarchical clustering.
                        Just like the distance metric this is only relevant if
                        you are using it with `--cluster-contigs` flag. The
                        default is "ward".

NUMBERS: Defaults of these parameters will impact your analysis. You can always come back to them and update your profiles, but it is important to make sure defaults are reasonable for your sample.

  -M INT, --min-contig-length INT
                        Minimum length of contigs in a BAM file to analyze.
                        The minimum length should be long enough for tetra-
                        nucleotide frequency analysis to be meaningful. There
                        is no way to define a golden number of minimum length
                        that would be applicable to genomes found in all
                        environments, but we chose the default to be 1000, and
                        have been happy with it. You are welcome to
                        experiment, but we advise to never go below 1,000. You
                        also should remember that the lower you go, the more
                        time it will take to analyze all contigs. You can use
                        --list-contigs parameter to have an idea how many
                        contigs would be discarded for a given M.
  --max-contig-length INT
                        Just like the minimum contig length parameter, but to
                        set a maximum. Basically this will remove any contig
                        longer than a certain value. Why would anyone need
                        this? Who knows. But if you ever do, it is here.
                        (default: 0)
  -X INT, --min-mean-coverage INT
                        Minimum mean coverage for contigs to be kept in the
                        analysis. The default value is 0, which is for your
                        best interest if you are going to profile multiple BAM
                        files which are then going to be merged for a cross-
                        sectional or time series analysis. Do not change it if
                        you are not sure this is what you want to do.
  -V INT, --min-coverage-for-variability INT
                        Minimum coverage of a nucleotide position to be
                        subjected to SNV profiling. By default, anvi'o will
                        not attempt to make sense of variation in a given
                        nucleotide position if it is covered less than 10X.
                        You can change that minimum using this parameter. This
                        parameter also controls the minimum coverage for
                        reporting indels. If an indel is observed at a
                        position, yet the coverage of the position in the
                        contig where the indel starts is less than this
                        parameter, the indel will be discarded.

CONTIGS: Sweet parameters of convenience

  --list-contigs        When declared, the program will list contigs in the
                        BAM file and exit gracefully without any further
                        analysis. (default: False)
  --contigs-of-interest FILE
                        It is possible to focus on only a set of contigs. If
                        you would like to do that and ignore the rest of the
                        contigs in your contigs database, use this parameter
                        with a flat file every line of which desribes a single
                        contig name. (default: None)

PERFORMANCE: Performance settings for profiler

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --queue-size INT      The queue size for worker threads to store data to
                        communicate to the main thread. The default is set by
                        the class based on the number of threads. If you have
                        *any* hesitation about whether you know what you are
                        doing, you should not change this value. (default: 0)
  --write-buffer-size-per-thread INT
                        How many items should be kept in memory before they
                        are written do the disk. The default is 500 per
                        thread. So a single-threaded job would have a write
                        buffer size of 500, whereas a job with 4 threads would
                        have a write buffer size of 4*500. The larger the
                        buffer size, the less frequent the program will access
                        to the disk, yet the more memory will be consumed
                        since the processed items will be cleared off the
                        memory only after they are written to the disk. The
                        default buffer size will likely work for most cases.
                        Please keep an eye on the memory usage output to make
                        sure the memory use never exceeds the size of the
                        physical memory.
  --force-multi         This is not useful to non-developers. It forces the
                        multi-process routine even when 1 thread is chosen.
                        (default: False)

anvi-profile-blitz

FAST profiling of BAM files to get contig- or gene-level coverage and detection stats. Unlike anvi-profile, which is another anvi'o program that can profile BAM files, this program is designed to be very quick and only report long-format files for various read recruitment statistics per item. Plase also see the program anvi-script-get-coverage-from-bam for recovery of data from BAM files without an anvi'o contigs database

Usage

usage: anvi-profile-blitz [-h] -c CONTIGS_DB [--gene-mode]
                          [--gene-caller GENE-CALLER] -o FILE_PATH
                          [--report-minimal]
                          BAM_FILES) [BAM_FILE(S ...]

Parameters

positional arguments:

  BAM_FILE(S)           One or more indexed BAM files

optional arguments:

  -h, --help            show this help message and exit

INPUT DB: You will need to give this program an anvi'o contigs database.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

GENES?: You can work with genes instead of contigs

  --gene-mode           This program by default will summarize coverage and
                        detection stats for contigs found in your contigs
                        database. Declaring this flag will change that
                        behavior and report coverage and detection stats for
                        each gene. Brace yourself for a huge file for large
                        contigs databases lol :( (default: False)
  --gene-caller GENE-CALLER
                        The gene caller to utilize. Anvi'o supports multiple
                        gene callers, and some operations (including this one)
                        requires an explicit mentioning of which one to use.
                        The default prodigal is but it will not be enough if
                        you were experiencing your rebelhood as you should,
                        and have generated your contigs database with
                        `--external-gene-callers` or something. Also, some HMM
                        collections may add new gene calls into a given
                        contigs database as an ad-hoc fashion, so if you want
                        to see all the options available to you in a given
                        contigs database, please run the program `anvi-db-
                        info` and take a look at the output. (default:
                        prodigal)

OUTPUT: How do you want to store your output data.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --report-minimal      Using this flag, you can ask anvi'o to report minimum
                        amount of data about your genes or contigs (such as
                        mean coverage and detection) rather than a full blown
                        output file with as much information as anvi'o can
                        offer (such as, mean coverage, detection, Q2Q3
                        coverage, standard deviation of coverage, min/max
                        values of coverage, GC-content and length of items,
                        etc). Using this flag can cut your processing time in
                        half. See the help docs for example output files for
                        contigs and gene mode. (default: False)

anvi-push

Push stuff to an anvi'server

Usage

usage: anvi-push [-h] --user USERNAME [--api-url API_URL] -n PROJECT_NAME
                 [-t NEWICK] [--items-order FLAT_FILE] [-f FASTA file]
                 [-d VIEW_DATA] [-A ADDITIONAL_LAYERS] [-s STATE]
                 [--description TEXT_FILE] [--bins BINS_DATA]
                 [--bins-info BINS_INFO] [--delete-if-exists]

Parameters

SERVER DETAILS: Details of how to access to an anvi'server instance.

  --user USERNAME       The user for an anvi'server. (default: None)
  --api-url API_URL     Anvi'server url (default: https://anvi-server.org)

PROJECT DETAILS: What to send to the server

  -n PROJECT_NAME, --project-name PROJECT_NAME
                        Name of the project. Please choose a short but
                        descriptive name (so anvi'o can use it whenever she
                        needs to name an output file, or add a new table in a
                        database, or name her first born). (default: None)
  -t NEWICK, --tree NEWICK
                        NEWICK formatted tree structure (default: None)
  --items-order FLAT_FILE
                        A flat file that contains the order of items you wish
                        the display using the interactive interface. You may
                        want to use this if you have a specific order of items
                        in your mind, and do not want to display a tree in the
                        middle (or simply you don't have one). The file format
                        is simple: each line should have an item name, and
                        there should be no header. (default: None)
  -f FASTA file, --fasta-file FASTA file
                        A FASTA-formatted input file. (default: None)
  -d VIEW_DATA, --view-data VIEW_DATA
                        A TAB-delimited file for view data (default: None)
  -A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
                        A TAB-delimited file for additional layers for splits.
                        The first column of this file must be split names, and
                        the remaining columns should be unique attributes. The
                        file does not need to contain all split names, or
                        values for each split in every column. Anvi'o will try
                        to deal with missing data nicely. Each column in this
                        file will be visualized as a new layer in the tree.
                        (default: None)
  -s STATE, --state STATE
                        State file, you can export states from database using
                        anvi-export-state program (default: None)
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)
  --bins BINS_DATA      Tab-delimited file, first column contains tree leaves
                        (gene clusters, splits, contigs etc.) and second
                        column contains which Bin they belong. (default: None)
  --bins-info BINS_INFO
                        Additional information for bins. The file must contain
                        three TAB-delimited columns, where the first one must
                        be a unique bin name, the second should be a 'source',
                        and the last one should be a 7 character HTML color
                        code (i.e., '#424242'). Source column must contain
                        information about the origin of the bin. If these bins
                        are automatically identified by a program like
                        CONCOCT, this column could contain the program name
                        and version. The source information will be associated
                        with the bin in various interfaces so in a sense it is
                        not *that* critical what it says there, but on the
                        other hand it is, becuse we should also think about
                        people who may end up having to work with what we put
                        together later. (default: None)

RISKY CLICKS: As the name suggests!

  --delete-if-exists    Be bold (at your own risk), and delete if exists.
                        (default: False)

anvi-refine

Start an anvi'o interactive interactive to manually curate or refine a genome, whether it is a metagenome-assembled, single-cell, or an isolate genome

Example uses and other resources

Usage

usage: anvi-refine [-h] -p PROFILE_DB -c CONTIGS_DB [-C COLLECTION_NAME]
                   [-b BIN_NAME] [-B FILE_PATH]
                   [--find-from-split-name SPLIT_NAME] [-t NEWICK]
                   [--skip-hierarchical-clustering] [--load-full-state]
                   [-V ADDITIONAL_VIEW] [-A ADDITIONAL_LAYERS]
                   [-F FUNCTION ANNOTATION SOURCE] [--show-all-layers]
                   [--split-hmm-layers]
                   [--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]
                   [--hide-outlier-SNVs] [--title NAME]
                   [--export-svg FILE_PATH] [--dry-run]
                   [--skip-init-functions] [--skip-news]
                   [--skip-auto-ordering] [-I IP_ADDR] [-P INT]
                   [--browser-path PATH] [--read-only] [--server-only]
                   [--password-protected]

Parameters

DEFAULT INPUTS: The interavtive interface can be started with and without anvi'o databases. The default use assumes you have your profile and contigs database, however, it is also possible to start the interface using ad-hoc input files. See 'MANUAL INPUT' section for other set of parameters that are mutually exclusive with datanases.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

REFINE-SPECIFICS: Parameters that are essential to the refinement process.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  -B FILE_PATH, --bin-ids-file FILE_PATH
                        Text file for bins (each line should be a unique bin
                        id). (default: None)
  --find-from-split-name SPLIT_NAME
                        If you don't know the bin name you want to work with
                        but if you know the split name it contains you can use
                        this parameter to tell anvi'o the split name, and so
                        it can find the bin for you automatically. This is
                        something extremely difficult for anvi'o to do, but it
                        does it anyway because you. (default: None)

ADDITIONAL STUFF: Parameters to provide additional layers, views, or layer data.

  -t NEWICK, --tree NEWICK
                        NEWICK formatted tree structure (default: None)
  --skip-hierarchical-clustering
                        Skip hierarchical clustering for the splits in the
                        refined bin, if you skip clustering you need to
                        provide your own newick formatted tree using --tree
                        parameter. (default: False)
  --load-full-state     Often the minimum and maximum values defined for the
                        an entire profile database that contains all contigs
                        do not scale well when you wish to work with a single
                        bin in the refine mode. For this reason, the default
                        behavior of anvi-refine is to ignore min/max values
                        set in the default state. This flag is your way of
                        telling anvi'o to not do that, and load the state
                        stored in the profile database as is. Please note that
                        this variable has no influence on the `detection`
                        view. For the `detection` view, anvi'o will always
                        load the global detection settings as if you have used
                        this flag. (default: False)
  -V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
                        A TAB-delimited file for an additional view to be used
                        in the interface. This file should contain all split
                        names, and values for each of them in all samples.
                        Each column in this file must correspond to a sample
                        name. Content of this file will be called 'user_view',
                        which will be available as a new item in the 'views'
                        combo box in the interface (default: None)
  -A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
                        A TAB-delimited file for additional layers for splits.
                        The first column of this file must be split names, and
                        the remaining columns should be unique attributes. The
                        file does not need to contain all split names, or
                        values for each split in every column. Anvi'o will try
                        to deal with missing data nicely. Each column in this
                        file will be visualized as a new layer in the tree.
                        (default: None)
  -F FUNCTION ANNOTATION SOURCE, --annotation-source-for-per-split-summary FUNCTION ANNOTATION SOURCE
                        Using this parameter with a functional annotation
                        source that (1) is in the contigs database and (2) has
                        a maximum of 10 different function names, will
                        dynamically add a new layer to the intearctive
                        interface where proportions of functions in that
                        source will be shown per split as stacked bar charts.
                        (default: None)

VISUALS RELATED: Parameters that give access to various adjustements regarding the interface.

  --show-all-layers     When declared, this flag tells the interface to show
                        every additional layer even if there are no hits. By
                        default, anvi'o doesn't show layers if there are no
                        hits for any of your items. (default: False)
  --split-hmm-layers    When declared, this flag tells the interface to split
                        every gene found in HMM searches that were performed
                        against non-singlecopy gene HMM profiles into their
                        own layer. Please see the documentation for details.
                        (default: False)
  --taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}
                        The taxonomic level to use whenever relevant and/or
                        available. The default taxonomic level is t_genus, but
                        if you choose something specific, anvi'o will focus on
                        that whenever possible.
  --hide-outlier-SNVs   During profiling, anvi'o marks positions of single-
                        nucleotide variations (SNVs) that originate from
                        places in contigs where coverage values are a bit
                        'sketchy'. If you would like to avoid SNVs in those
                        positions of splits in applicable projects you can use
                        this flag, and the interface would hide SNVs that are
                        marked as 'outlier' (although it is clearly the best
                        to see everything, no one will judge you if you end up
                        using this flag) (plus, there may or may not be some
                        historical data on this here:
                        https://github.com/meren/anvio/issues/309). (default:
                        False)
  --title NAME          Title for the interface. If you are working with a
                        RUNINFO dict, the title will be determined based on
                        information stored in that file. Regardless, you can
                        override that value using this parameter. (default:
                        None)
  --export-svg FILE_PATH
                        The SVG output file path. (default: None)

SWEET PARAMS OF CONVENIENCE: Parameters and flags that are not quite essential (but nice to have).

  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)
  --skip-init-functions
                        When declared, function calls for genes will not be
                        initialized (therefore will be missing from all
                        relevant interfaces or output files). The use of this
                        flag may reduce the memory fingerprint and processing
                        time for large datasets. (default: False)
  --skip-news           Don't try to read news content from upstream.
                        (default: False)
  --skip-auto-ordering  When declared, the attempt to include automatically
                        generated orders of items based on additional data is
                        skipped. In case those buggers cause issues with your
                        data, and you still want to see your stuff and deal
                        with the other issue maybe later. (default: False)

SERVER CONFIGURATION: For power users.

  -I IP_ADDR, --ip-address IP_ADDR
                        IP address for the HTTP server. The default ip address
                        (0.0.0.0) should work just fine for most.
  -P INT, --port-number INT
                        Port number to use for anvi'o services. If nothing is
                        declared, anvi'o will try to find a suitable port
                        number, starting from the default port number, 8080.
                        (default: None)
  --browser-path PATH   By default, anvi'o will use your default browser to
                        launch the interactive interface. If you would like to
                        use something else than your system default, you can
                        provide a full path for an alternative browser using
                        this parameter, and hope for the best. For instance we
                        are using this parameter to call Google's experimental
                        browser, Canary, which performs better with demanding
                        visualizations. (default: None)
  --read-only           When the interactive interface is started with this
                        flag, all 'database write' operations will be
                        disabled. (default: False)
  --server-only         The default behavior is to start the local server, and
                        fire up a browser that connects to the server. If you
                        have other plans, and want to start the server without
                        calling the browser, this is the flag you need.
                        (default: False)
  --password-protected  If this flag is set, command line tool will ask you to
                        enter a password and interactive interface will be
                        only accessible after entering same password. This
                        option is recommended for shared machines like
                        clusters or shared networks where computers are not
                        isolated. (default: False)

anvi-rename-bins

Rename all bins in a given collection (so they have pretty names)

Usage

usage: anvi-rename-bins [-h] -c CONTIGS_DB -p PROFILE_DB
                        [--collection-to-read COLLECTION_TO_READ]
                        [--collection-to-write COLLECTION_TO_WRITE]
                        [--prefix PREFIX] [--report-file REPORT_FILE_PATH]
                        [--list-collections] [--dry-run] [--call-MAGs]
                        [--min-completion-for-MAG MIN_COMPLETION_FOR_MAG]
                        [--max-redundancy-for-MAG MAX_REDUNDANCY_FOR_MAG]
                        [--size-for-MAG MEGABASEPAIRS]

Parameters

DEFAULT INPUTS: Standard stuff

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  --collection-to-read COLLECTION_TO_READ
                        Collection name to read from. Anvi'o will not
                        overwrite an existing collection, instead, it will
                        create a copy of your collection with new bin names.
                        (default: None)
  --collection-to-write COLLECTION_TO_WRITE
                        The new collection name. Give it a nice, fancy name.
                        (default: None)

OUTPUT AND TESTING: a.k.a, sweet parameters of convenience

  --prefix PREFIX       Prefix for the bin names. Must be a single word,
                        composed of digits and numbers. The use of the
                        underscore character is OK, but that's about it (fine,
                        the use of the dash character is OK, too but no
                        more!). If the prefix is 'PREFIX', each bin will be
                        renamed as 'PREFIX_XXX_00001, PREFIX_XXX_00002', and
                        so on, in the order of percent completion minus
                        percent redundancy (what we call, 'substantive
                        completion'). The 'XXX' part will either be 'Bin', or
                        'MAG depending on other parameters you use. Keep
                        reading. (default: None)
  --report-file REPORT_FILE_PATH
                        This file will report each name change event, so you
                        can trace back the original names of renamed bins
                        later. (default: None)
  --list-collections    Show available collections and exit. (default: False)
  --dry-run             When used does NOT update the profile database, just
                        creates the report file so you can view how things
                        will be renamed. (default: False)

MAG OPTIONS: If you want to call some bins 'MAGs' because you are so cool

  --call-MAGs           This program by default rename your bins as
                        'PREFIX_Bin_00001', 'PREFIX_Bin_00002' and so on. If
                        you use this flag, it will name the ones that meet the
                        criteria described by MAG-related flags as
                        'PREFIX_MAG_00001', 'PREFIX_MAG_00002', and so on. The
                        ones that do not get to be named as MAGs will remain
                        as bins. (default: False)
  --min-completion-for-MAG MIN_COMPLETION_FOR_MAG
                        If --call-MAGs flag is used, call any bin a 'MAG' if
                        their completion estimate is above this (the default
                        is 70), and the redundancy estimate is less than
                        --max-redundancy-for-MAG.
  --max-redundancy-for-MAG MAX_REDUNDANCY_FOR_MAG
                        If --call-MAGs flag is used, call any bin a 'MAG' if
                        their redundancy estimate is below this (the default
                        is 10) and the completion estimate is above --min-
                        completion-for-MAG.
  --size-for-MAG MEGABASEPAIRS
                        If --call-MAGs flag is used, call any bin a 'MAG' if
                        their redundancy estimate is less than --max-
                        redundancy-for-MAG, AND THEIR SIZE IS LARGER THAN THIS
                        VALUE REGARDLESS OF THE COMPLETION ESTIMATE. The
                        default behavior is to not care about this at all.
                        (default: 0.0)

anvi-report-inversions

Reports inversions

Usage

usage: anvi-report-inversions [-h] --my-name-is YOUR NAME [-P FILE_PATH]
                              [--min-coverage-to-define-stretches INT]
                              [--min-stretch-length INT]
                              [--min-distance-between-independent-stretches INT]
                              [--num-nts-to-pad-a-stretch INT] [-l INT]
                              [-m INT] [-d INT]
                              [--process-only-inverted-reads] [--verbose]
                              [--only-report-from REGION_STRING]

Parameters

WHO ARE YOU?: Are you a part of our very secret group? FIND OUT HERE!.

  --my-name-is YOUR NAME
                        What is your name? (default: anyone)

INPUT DATA: Essentially a BAMs and profiles file and nothing more.

  -P FILE_PATH, --bams-and-profiles FILE_PATH
                        A four-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name',
                        'contigs_db_path', 'profile_db_path', and
                        'bam_file_path'. See the profiles-and-bams.txt
                        artifact for the details of the file. (default: None)

KEY ALGORITHMIC COMPONENT 01: IDENTIFYING REGIONS OF INTERSET: How should anvi'o identify regions of interest based on REV/REV and FWD/FWD paired-end reads? Defaults will be good for most cases.

  --min-coverage-to-define-stretches INT
                        Value to break up contigs into 'stretches' of high-
                        coverage regions of FWD/FWD and REV/REV reads. The
                        lower the value, the more noise. This acts as a low-
                        pass filter if it helps you imagine how it works.
                        (default: 10)
  --min-stretch-length INT
                        These are not the stretches you are looking for
                        (unless they are longer than this, obv). (default: 50)
  --min-distance-between-independent-stretches INT
                        Our 'low pass' filter may break a single stretch of
                        reasonable coverage of FWD/FWD and REV/REV reads into
                        multiple pieces. To recover from that, we wish to
                        merge the fragmented ones if they are closer to one
                        another than this value. (default: 2000)
  --num-nts-to-pad-a-stretch INT
                        Some leeway towards upstream and downstream context
                        that is essential to not miss key information due to
                        coverage variation that may influence the beginnings
                        and ends of final stretches. (default: 100)

KEY ALGORITHMIC COMPONENT 02: FINDING PALINDROMES: Some essential parameters to find palindromes in sequence stretches anvi'o identified in the previous step

  -l INT, --min-palindrome-length INT
                        The minimum palindrome length. (default: 10)
  -m INT, --max-num-mismatches INT
                        The maximum number of mismatches allowed. (default: 0)
  -d INT, --min-distance INT
                        The minimum distance between the palindromic sequences
                        (this parameter is essentially asking for the number
                        of `x` in the sequence `ATCGxxxCGAT`). The default is
                        50, which means the algorithm will never report by
                        default sequences that are like `ATCGCGAT` with no
                        gaps between the palindrome where the palindromic
                        sequence matches itself (but you can get such
                        palindromes by setting this parameter to 0). (default:
                        50)

KEY ALGORITHMIC COMPONENT 03: CONFIRMING INVERSIONS: Which palindromes are inversions? A one million dollar question that is quite difficult to get right (but as anvi'o does get it right frequently).

  --process-only-inverted-reads
                        At one point, anvi'o will have all the regions of
                        interest in contigs that include palindromes that look
                        promising. At that point, it will access to short
                        reads in the BAM file to determine which palindromes
                        in fact represent active inversions by searching for
                        unique constructs that can only occur when a genomic
                        region did funny things. One option is to search for
                        such reads that are evidence of inversion activity
                        among paired-end reads that are in FWD/FWD or REV/REV
                        orientation. Which is defined by the fetch-filter
                        'inversions'. However, this may be too limiting as
                        there may be paired-end reads that are too downstream
                        or too upstream to the region of interest, and thus
                        mapping FWD/REV or REV/FWD orientation just like every
                        other non-rebellious paired end read, YET including
                        one of the constructs. Thus, searching all reads may
                        in fact help the identification of more inversions
                        (especially those that are covered less), with only an
                        added disadvantage of compute time, which should be
                        negligible in almost all instances (since anvi'o at
                        this stage is only focusing on very specific regions
                        of genomes). But this parameter is here in case you
                        insist on only using inverted paired-end reads and
                        assert your authority. You do you and turn on the
                        flag, you rebellious scientist who will likely miss a
                        lot of additoinal inversions like a boss. (default:
                        False)

OTHER PARAMETERS OF CONVENIENCE: For the never satisfied.

  --verbose             Be verbose, print more messages whenever possible. You
                        may regret this. (default: False)
  --only-report-from REGION_STRING
                        This is more of a debugging flag more than anything,
                        and can be helpful to work on problematic cases over
                        and over again. To engage this mode, you need to
                        mention a region mentioned in in your output file or
                        printouts you see when you use the `--verbose` flag.
                        It will be different from project to project, but it
                        should look something like this:
                        CONTIG_NAME_5364768_5365139. When declared, anvi'o
                        will only report data from this particular region, and
                        nothing at all if it does not find it in your while
                        processing palindromes. (default: None)

anvi-report-linkmers

Reports sequences stored in one or more BAM files that cover one of more specific nucleotide positions in a reference

Usage

usage: anvi-report-linkmers [-h] -i INPUT_BAMS) [INPUT_BAM(S ...]
                            --contigs-and-positions CONTIGS_AND_POS
                            [--only-complete-links] -o FILE_PATH
                            [--list-contigs]

Parameters

optional arguments:

  -i INPUT_BAM(S) [INPUT_BAM(S) ...], --input-files INPUT_BAM(S) [INPUT_BAM(S) ...]
                        Sorted and indexed BAM files to analyze. It is
                        essential that all BAM files must be the result of
                        mappings against the same contigs. (default: None)
  --contigs-and-positions CONTIGS_AND_POS
                        This is the file where you list the contigs, and
                        nucleotide positions you are interested in. This is
                        supposed to be a TAB-delimited file with two columns.
                        In each line, the first column should be the contig
                        name, and the second column should be the comma-
                        separated list of integers for nucleotide positions.
                        (default: None)
  --only-complete-links
                        When declared, only reads that cover all positions
                        will be reported. It is necessary to use this flag if
                        you want to perform oligotyping-like analyses on
                        matching reads. (default: False)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --list-contigs        When declared, the program will list contigs in the
                        BAM file and exit gracefully without any further
                        analysis. (default: False)

anvi-run-hmms

This program deals with populating tables that store HMM hits in an anvi'o contigs database

Example uses and other resources

Usage

usage: anvi-run-hmms [-h] -c CONTIGS_DB [-H HMM PROFILE PATH]
                     [-I HMM PROFILE NAMES]
                     [--hmmer-output-dir OUTPUT DIRECTORY PATH]
                     [--domain-hits-table] [--also-scan-trnas]
                     [-T NUM_THREADS] [--hmmer-program HMMER_PROGRAM]
                     [--just-do-it]

Parameters

DB: An anvi'o contigs adtabase to populate with HMM hits

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

HMM OPTIONS: If you have your own HMMs, or if you would like to run only a set of default anvi'o HMM profiles rather than running them all, this is your stop.

  -H HMM PROFILE PATH, --hmm-profile-dir HMM PROFILE PATH
                        You can use this parameter you can specify a directory
                        path that contain an HMM profile. This way you can run
                        HMM profiles that are not included in anvi'o. See the
                        online to find out about the specifics of this
                        directory structure . (default: None)
  -I HMM PROFILE NAME(S), --installed-hmm-profile HMM PROFILE NAME(S)
                        When you run this program without any parameter, it
                        will run all 9 HMM profiles installed on your system.
                        Using this parameter, you can instruct anvi'o to run
                        only one or more of the specific profiles of you
                        choose. You can provide a comma-separated list of
                        names for multiple profiles (but in that case don't
                        put a space between each profile name). Here is the
                        list of installed profiles available to you: are the
                        currently available ones: 'Bacteria_71' (type:
                        singlecopy); 'Archaea_76' (type: singlecopy);
                        'Ribosomal_RNA_23S' (type: Ribosomal_RNA_23S);
                        'Ribosomal_RNA_28S' (type: Ribosomal_RNA_28S);
                        'Ribosomal_RNA_5S' (type: Ribosomal_RNA_5S);
                        'Ribosomal_RNA_16S' (type: Ribosomal_RNA_16S);
                        'Ribosomal_RNA_12S' (type: Ribosomal_RNA_12S);
                        'Protista_83' (type: singlecopy); 'Ribosomal_RNA_18S'
                        (type: Ribosomal_RNA_18S). (default: None)
  --hmmer-output-dir OUTPUT DIRECTORY PATH
                        If you provide a path with this parameter, then the
                        HMMER output file(s) will be saved in this directory.
                        Please note that this will only work if you are
                        running on only one profile using the -I flag.
                        (default: None)
  --domain-hits-table   Use this flag in conjunction with --hmmer-output-dir
                        to request domain table output from HMMER (i.e., the
                        file specified by the --domtblout flag from hmmsearch
                        or hmmscan). Otherwise, only the regular --tblout file
                        will be stored in the specified directory. Please note
                        that even if you use this flag, the HMM hits stored in
                        the database will be taken from the --tblout file
                        only. Also, this option only works with HMM profiles
                        for amino acid sequences (not nucleotides). (default:
                        False)

tRNAs: Through this program you can also scan Transfer RNA sequences in your contigs database for free (instead of running anvi-scan-trnas later).

  --also-scan-trnas     Also scan tRNAs while you're at it. (default: False)

PERFORMANCE: Stuff everyone forgets to set and then get upset with how slow science goes.

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --hmmer-program HMMER_PROGRAM
                        Which of the HMMER programs to use to run HMMs
                        (hmmscan or hmmsearch). By default anvi'o will use
                        hmmscan for typical HMM operations like those in anvi-
                        run-hmms (as these tend to scan a very large number of
                        genes against a relatively small number of HMMs), but
                        if you are using this program to scan a very large
                        number of HMMs, hmmsearch might be a better choice for
                        performance. For this reason, hmmsearch is the default
                        in operations like anvi-run-pfams and anvi-run-kegg-
                        kofams. See this article for a discussion on the
                        performance of these two programs:
                        https://cryptogenomicon.org/2011/05/27/hmmscan-vs-
                        hmmsearch-speed-the-numerology/ (default: None)

AUTHORITY: Because you are the boss.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-run-interacdome

Run InteracDome on a contigs database

Example uses and other resources

Usage

usage: anvi-run-interacdome [-h] -c CONTIGS_DB [--interacdome-data-dir PATH]
                            [--interacdome-dataset {representable,confident}]
                            [-m FLOAT] [-f FLOAT] [-t FLOAT] [-T NUM_THREADS]
                            [--just-do-it] [-O FILENAME_PREFIX]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --interacdome-data-dir PATH
                        The path for the interacdome data to be stored. If you
                        leave it as is without specifying anything, anvi'o
                        will set up everything in a pre-defined default
                        directory. The advantage of using the default
                        directory at the time of set up is that every user of
                        anvi'o on a computer system will be using a single
                        data directory, but then you may need to run the setup
                        program with superuser privileges. If you don't have
                        superuser privileges, then you can use this parameter
                        to tell anvi'o the location you wish to use to setup
                        your data. (default: None)
  --interacdome-dataset {representable,confident}
                        Choose 'representable' to include Pfams that
                        correspond to domain-ligand interactions that had
                        nonredundant instances across three or more distinct
                        PDB structures. InteracDomeauthors recommend using
                        this collection to learn more about domain binding
                        properties. Choose 'confident' to include Pfams that
                        correspond to domain-ligand interactions that had
                        nonredundant instances across three or more distinct
                        PDB entries and achieved a cross-validated precision
                        of at least 0.5. We recommend using this collection to
                        annotate potential ligand-binding positions in protein
                        sequences. The default is 'representable'.
  -m FLOAT, --min-binding-frequency FLOAT
                        InteracDome has associated binding 'frequencies',
                        which can be considered scores between 0 to 1 that
                        quantify how likely a position is to be involved in
                        binding. Use this parameter to filter out low
                        frequencies. The default is 0.200000. Warning, your
                        contigs database size will grow massively if this is
                        set to 0.0, but you're the boss.
  -f FLOAT, --min-hit-fraction FLOAT
                        Any hits where the hit length--relative to the HMM
                        profile--divided by the total HMM profile length, is
                        less than this value, it will be removed from the
                        results and will not contribute to binding
                        frequencies. The default is 0.5
  -t FLOAT, --information-content-cutoff FLOAT
                        This parameter can be used to control for low-quality
                        domain hits. Each domain is composed of positions
                        (match states) with varying degrees of conservancy,
                        which can be quantified with information content (IC).
                        High IC means highly conserved. For example, IC = 4
                        corresponds to 95% of the members of the Pfam sharing
                        the same amino acid at that position. By default,
                        anvi'o demands that for an alignment of a user's gene
                        with a Pfam HMM, the gene sequence must match with the
                        consensus amino acid of each match state that has IC >
                        4.000000. For context, it is common for a Pfam to not
                        even have a position with an IC > 4, so these
                        represent truly very conserved positions. You can
                        modify this with this parameter. For example, if you
                        think this is dumb, you can set this to 10000, and
                        then no domain hits will be removed for this reason.
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        INTERACDOME)

anvi-run-kegg-kofams

Run KOfam HMMs on an anvi'o contigs database

Usage

usage: anvi-run-kegg-kofams [-h] -c CONTIGS_DB [--kegg-data-dir KEGG_DATA_DIR]
                            [-T NUM_THREADS] [--hmmer-program HMMER_PROGRAM]
                            [--keep-all-hits] [--log-bitscores] [--just-do-it]
                            [--skip-bitscore-heuristic] [-E FLOAT] [-H FLOAT]

Parameters

REQUIRED INPUT: The stuff you need for this to work.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

OPTIONAL INPUT: Optional params for a custom experience.

  --kegg-data-dir KEGG_DATA_DIR
                        The directory path for your KEGG setup, which will
                        include things like KOfam profiles and KEGG MODULE
                        data. Anvi'o will try to use the default path if you
                        do not specify anything. (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --hmmer-program HMMER_PROGRAM
                        Which of the HMMER programs to use to run HMMs
                        (hmmscan or hmmsearch). By default anvi'o will use
                        hmmscan for typical HMM operations like those in anvi-
                        run-hmms (as these tend to scan a very large number of
                        genes against a relatively small number of HMMs), but
                        if you are using this program to scan a very large
                        number of HMMs, hmmsearch might be a better choice for
                        performance. For this reason, hmmsearch is the default
                        in operations like anvi-run-pfams and anvi-run-kegg-
                        kofams. See this article for a discussion on the
                        performance of these two programs:
                        https://cryptogenomicon.org/2011/05/27/hmmscan-vs-
                        hmmsearch-speed-the-numerology/ (default: None)
  --keep-all-hits       If you use this flag, anvi'o will not get rid of any
                        raw HMM hits, even those that are below the score
                        threshold. (default: False)
  --log-bitscores       Use this flag to generate a tab-delimited text file
                        containing the bit scores of every KOfam hit that is
                        put in the contigs database. (default: False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

BITSCORE RELAXATION HEURISTIC: Sometimes, KEGG-provided bitscore thresholds are too stringent, causing us to miss valid annotations. We apply the following heuristic to relax those thresholds and annotate genes that would otherwise miss a valid annotations: for every gene call that is not annotated, examine hits to that gene with an evalue <= X and a bitscore > Y * KEGG's threshold (where Y is a float from 0 to 1). If those hits are all to a unique KO profile, annotate the gene with that KO. Long story over, you can set X and Y using the parameters below.

  --skip-bitscore-heuristic
                        If you just want annotations from KOfam hits that are
                        above the KEGG bitscore threshold, use this flag to
                        skip the mumbo-jumbo we do here to relax those
                        thresholds. (default: False)
  -E FLOAT, --heuristic-e-value FLOAT
                        When considering hits that didn't quite make the
                        bitscore cut-off for a gene, we will only look at hits
                        with e-values <= this number. (This is X.) (default:
                        1e-05)
  -H FLOAT, --heuristic-bitscore-fraction FLOAT
                        When considering hits that didn't quite make the
                        bitscore cut-off for a gene, we will only look at hits
                        with bitscores > the KEGG threshold * this number.
                        (This is Y.) It should be a fraction between 0 and 1
                        (inclusive). (default: 0.75)

anvi-run-ncbi-cogs

This program runs NCBI's COGs to associate genes in an anvi'o contigs database with functions. COGs database was been designed as an attempt to classify proteins from completely sequenced genomes on the basis of the orthology concept.

Usage

usage: anvi-run-ncbi-cogs [-h] -c CONTIGS_DB [--cog-version COG_VERSION]
                          [--cog-data-dir COG_DATA_DIR] [-T NUM_THREADS]
                          [--sensitive] [--temporary-dir-path PATH]
                          [--search-with SEARCH_METHOD]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --cog-version COG_VERSION
                        COG version. The default is the latest version, which
                        is COG20, meaning that anvi'o will use the NCBI's 2020
                        release of COGs to setup the database and run it on
                        contigs databases. There is also an older version of
                        COGs from 2014. If you would like anvi'o to work with
                        that one, please use COG14 as a parameter. On a single
                        computer you can have both, and on a single contigs
                        database you can run both. Cool and confusing. The
                        anvi'o way. (default: None)
  --cog-data-dir COG_DATA_DIR
                        The directory path for your COG setup. Anvi'o will try
                        to use the default path if you do not specify
                        anything. (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --sensitive           DIAMOND sensitivity. With this flag you can instruct
                        DIAMOND to be 'sensitive', rather than 'fast' during
                        the search. It is likely the search will take
                        remarkably longer. But, hey, if you are doing it for
                        your final analysis, maybe it should take longer and
                        be more accurate. This flag is only relevant if you
                        are running DIAMOND. (default: False)
  --temporary-dir-path PATH
                        If you don't provide anything here, this program will
                        come up with a temporary directory path by itself to
                        store intermediate files, and clean it later. If you
                        want to have full control over this, you can use this
                        flag to define one. (default: None)
  --search-with SEARCH_METHOD
                        What program to use for database searching. The
                        default search uses diamond. All available options
                        include: diamond, blastp. (default: diamond)

anvi-run-pfams

Run Pfam on Contigs Database

Usage

usage: anvi-run-pfams [-h] -c CONTIGS_DB [--pfam-data-dir PFAM_DATA_DIR]
                      [-T NUM_THREADS] [--hmmer-program HMMER_PROGRAM]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --pfam-data-dir PFAM_DATA_DIR
                        The directory path for your Pfam setup. Anvi'o will
                        try to use the default path if you do not specify
                        anything. (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --hmmer-program HMMER_PROGRAM
                        Which of the HMMER programs to use to run HMMs
                        (hmmscan or hmmsearch). By default anvi'o will use
                        hmmscan for typical HMM operations like those in anvi-
                        run-hmms (as these tend to scan a very large number of
                        genes against a relatively small number of HMMs), but
                        if you are using this program to scan a very large
                        number of HMMs, hmmsearch might be a better choice for
                        performance. For this reason, hmmsearch is the default
                        in operations like anvi-run-pfams and anvi-run-kegg-
                        kofams. See this article for a discussion on the
                        performance of these two programs:
                        https://cryptogenomicon.org/2011/05/27/hmmscan-vs-
                        hmmsearch-speed-the-numerology/ (default: None)

anvi-run-scg-taxonomy

The purpose of this program is to affiliate single-copy core genes in an anvi'o contigs database with taxonomic names. A properly setup local SCG taxonomy database is required for this program to perform properly. After its successful run, anvi-estimate-scg-taxonomy will be useful to estimate taxonomy at genome-, collection-, or metagenome-level)

Example uses and other resources

Usage

usage: anvi-run-scg-taxonomy [-h] -c CONTIGS_DB
                             [--scgs-taxonomy-data-dir PATH]
                             [--min-percent-identity PERCENT_IDENTITY]
                             [--max-num-target-sequences NUMBER]
                             [-P NUM_PROCESSES] [-T NUM_THREADS]
                             [--write-buffer-size INT]
                             [--all-hits-output-file FILE_PATH]

Parameters

INPUT DATABASE: An anvi'o contigs databaes to search for and store the taxonomic affiliations of SCGs.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

ADVANCED STUFF:

  --scgs-taxonomy-data-dir PATH
                        The directory for SCGs data to be stored (or read
                        from, depending on the context). If you leave it as is
                        without specifying anything, anvi'o will set up
                        everything in (or try to read things from) a pre-
                        defined default directory. The advantage of using the
                        default directory at the time of set up is that every
                        user of anvi'o on a computer system will be using a
                        single data directory, but then you may need to run
                        the setup program with superuser privileges. If you
                        don't have superuser privileges, then you can use this
                        parameter to tell anvi'o the location you wish to use
                        to setup your databases. If you are using a program
                        (such as `anvi-run-scg-taxonomy` or `anvi-estimate-
                        scg-taxonomy`) you will have to use this parameter to
                        tell those programs where your data are. (default:
                        None)
  --min-percent-identity PERCENT_IDENTITY
                        The defualt value for this is 90.0%, and in an ideal
                        world you sholdn't really change it. Lowering this
                        value will probably give you too many hits from
                        neighboring genomes, which may ruin your consensus
                        taxonomy (imagine, at 90% identity you may match to a
                        single species, but at 70% identity you may match to
                        every species in a genus and your consensus assignment
                        may be influenced by that). But once in a while you
                        will have a genome that doesn't have any close match
                        in GTDB, and you will be curious to find out what it
                        could be. So, when you are getting no SCG hits
                        whatsoever, only then you may want to play with this
                        value. In those cases you can run anvi-estimate-scg-
                        taxonomy with a `--debug` flag to see what is really
                        going on. We strongly advice you to do this only with
                        single genomes, and never with metagenomes.
  --max-num-target-sequences NUMBER
                        This parameter is used to determine how many hits from
                        the database that has a reasonable match to the query
                        sequence should be taken into consideration to make a
                        final decision about the consensus taxonomy for each
                        individual single-copy core gene sequence. The default
                        is 20, which has been quite reasonable in our tests,
                        however, you may need to increase this number to get
                        more accurate results for your own data. In cases
                        where you think this is what you need, the best way to
                        test the parameter space for `--max-num-target-
                        sequences` is to run the program multiple times on the
                        same database with `--debug` and compare results.

PERFORMANCE:

  -P NUM_PROCESSES, --num-parallel-processes NUM_PROCESSES
                        Maximum number of processes to run in parallel. Please
                        note that this is different than number of threads. If
                        you ask for 4 parallel processes, and 5 threads,
                        anvi'o will run four processes in parallel and assign
                        5 threads to each. For resource allocation you must
                        multiply the number of processes and threads.
                        (default: 1)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --write-buffer-size INT
                        How many items should be kept in memory before they
                        are written to the disk. The default is 500. The
                        larger the buffer size, the less frequently the
                        program will access the disk, yet the more memory will
                        be consumed since the processed items will be cleared
                        off the memory only after they are written to the
                        disk. The default buffer size will likely work for
                        most cases, but if you feel you need to reduce it, we
                        trust you. Please keep an eye on the memory usage
                        output to make sure the memory use never exceeds the
                        size of the physical memory.

OUTPUT: By default, this program does not generate an output and instead simply store taxonomy information into the contigs database. But if the user wants more, they get more.

  --all-hits-output-file FILE_PATH
                        If this flag is declared, anvi'o will store a
                        comprehensive list of hits that led to the
                        determination of the consensus hit per sequence (which
                        is the only piece of information that is stored in the
                        contigs database). (default: None)

anvi-run-trna-taxonomy

The purpose of this program is to affiliate tRNA gene sequences in an anvi'o contigs database with taxonomic names. A properly setup local tRNA taxonomy database is required for this program to perform properly. After its successful run, anvi-estimate-trna-taxonomy will be useful to estimate taxonomy at genome-, collection-, or metagenome-level).

Usage

usage: anvi-run-trna-taxonomy [-h] -c CONTIGS_DB
                              [--trna-taxonomy-data-dir PATH]
                              [--min-percent-identity PERCENT_IDENTITY]
                              [--max-num-target-sequences NUMBER]
                              [-P NUM_PROCESSES] [-T NUM_THREADS]
                              [--write-buffer-size INT]
                              [--all-hits-output-file FILE_PATH]

Parameters

INPUT DATABASE: An anvi'o contigs databaes to search for and store the taxonomic affiliations of tRNA genes.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

ADVANCED STUFF:

  --trna-taxonomy-data-dir PATH
                        The directory for tRNA taxonomy data to be stored (or
                        read from, depending on the context). If you leave it
                        as is without specifying anything, anvi'o will set up
                        everything in (or try to read things from) a pre-
                        defined default directory. The advantage of using the
                        default directory at the time of set up is that every
                        user of anvi'o on a computer system will be using a
                        single data directory, but then you may need to run
                        the setup program with superuser privileges. If you
                        don't have superuser privileges, then you can use this
                        parameter to tell anvi'o the location you wish to use
                        to setup your databases. If you are using a program
                        (such as `anvi-run-trna-taxonomy` or `anvi-estimate-
                        trna-taxonomy`) you will have to use this parameter to
                        tell those programs where your data are. (default:
                        None)
  --min-percent-identity PERCENT_IDENTITY
                        The defualt value for this is 90.0%, and in an ideal
                        world you sholdn't really change it. Lowering this
                        value will probably give you too many hits from
                        neighboring genomes, which may ruin your consensus
                        taxonomy (imagine, at 90% identity you may match to a
                        single species, but at 70% identity you may match to
                        every species in a genus and your consensus assignment
                        may be influenced by that). But once in a while you
                        will have a genome that doesn't have any close match
                        in GTDB, and you will be curious to find out what it
                        could be. So, when you are getting no tRNA hits
                        whatsoever, only then you may want to play with this
                        value. In those cases you can run anvi-estimate-trna-
                        taxonomy with a `--debug` flag to see what is really
                        going on. We strongly advice you to do this only with
                        single genomes, and never with metagenomes.
  --max-num-target-sequences NUMBER
                        This parameter is used to determine how many hits from
                        the database that has a reasonable match to the query
                        sequence should be taken into consideration to make a
                        final decision about the consensus taxonomy for each
                        individual transfer RNA gene sequence. The default is
                        100, which has been quite reasonable in our tests,
                        however, you may need to increase this number to get
                        more accurate results for your own data. In cases
                        where you think this is what you need, the best way to
                        test the parameter space for `--max-num-target-
                        sequences` is to run the program multiple times on the
                        same database with `--debug` and compare results.

PERFORMANCE:

  -P NUM_PROCESSES, --num-parallel-processes NUM_PROCESSES
                        Maximum number of processes to run in parallel. Please
                        note that this is different than number of threads. If
                        you ask for 4 parallel processes, and 5 threads,
                        anvi'o will run four processes in parallel and assign
                        5 threads to each. For resource allocation you must
                        multiply the number of processes and threads.
                        (default: 1)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --write-buffer-size INT
                        How many items should be kept in memory before they
                        are written to the disk. The default is 500. The
                        larger the buffer size, the less frequently the
                        program will access the disk, yet the more memory will
                        be consumed since the processed items will be cleared
                        off the memory only after they are written to the
                        disk. The default buffer size will likely work for
                        most cases, but if you feel you need to reduce it, we
                        trust you. Please keep an eye on the memory usage
                        output to make sure the memory use never exceeds the
                        size of the physical memory.

OUTPUT: By default, this program does not generate an output and instead simply store taxonomy information into the contigs database. But if the user wants more, they get more.

  --all-hits-output-file FILE_PATH
                        If this flag is declared, anvi'o will store a
                        comprehensive list of hits that led to the
                        determination of the consensus hit per sequence (which
                        is the only piece of information that is stored in the
                        contigs database). (default: None)

anvi-run-workflow

Execute, manage, parallelize, and troubleshoot entire 'omics workflows and chain together anvi'o and third party programs

metagenomics phylogenomics contigs pangenomics

Example uses and other resources

Usage

usage: anvi-run-workflow [-h] [-w WORKFLOW]
                         [--get-default-config OUTPUT_FILENAME]
                         [--list-workflows] [--list-dependencies]
                         [-c CONFIG_FILE] [--dry-run] [--skip-dry-run]
                         [--save-workflow-graph] [-A ...]

Parameters

ESSENTIAL INPUTS: Things you must provide or this won't work

  -w WORKFLOW, --workflow WORKFLOW
                        You must specify a workflow name. To see a list of
                        available workflows run --list-workflows. (default:
                        None)

ADDITIONAL STUFF: additional stuff

  --get-default-config OUTPUT_FILENAME
                        Store a json formatted config file with all the
                        default settings of the workflow. This is a good draft
                        you could use in order to write your own config file.
                        This config file contains all parameters that could be
                        configured for this workflow. NOTICE: the config file
                        is provided with default values only for parameters
                        that are set by us in the workflow. The values for the
                        rest of the parameters are determined by the relevant
                        program. (default: None)
  --list-workflows      Print a list of available snakemake workflows
                        (default: False)
  --list-dependencies   Print a list of the dependencies of this workflow. You
                        must provide a workflow name and a config file.
                        snakemake will figure out which rules need to be run
                        according to your config file, and according to the
                        files available on your disk. According to the rules
                        that need to be run, we will let you know which
                        programs are going to be used, so that you can make
                        sure you have all of them installed and loaded.
                        (default: False)
  -c CONFIG_FILE, --config-file CONFIG_FILE
                        A JSON-formatted configuration file. (default: None)
  --dry-run             Don't do anything real. Test everything, and stop
                        right before wherever the developer said 'well, this
                        is enough testing', and decided to print out results.
                        (default: False)
  --skip-dry-run        Don't do a dry run. Just start the workflow! Useful
                        when your job is so big it takes hours to do a dry
                        run. (default: False)
  --save-workflow-graph
                        Save a graph representation of the workflow. If you
                        are using this flag and if your system is unable to
                        generate such graph outputs, you will hear anvi'o
                        complaining (still, totally worth trying). (default:
                        False)
  -A ..., --additional-params ...
                        Additional snakemake parameters to add when running
                        snakemake. NOTICE: --additional-params HAS TO BE THE
                        LAST ARGUMENT THAT IS PASSED TO anvi-run-workflow,
                        ANYTHING THAT FOLLOWS WILL BE CONSIDERED AS PART OF
                        THE ADDITIONAL PARAMETERS THAT ARE PASSED TO
                        SNAKEMAKE. Any parameter that is accepted by snakemake
                        should be fair game here, but it is your
                        responsibility to make sure that whatever you added
                        makes sense. To see what parameters are available
                        please refer to the snakemake documentation. For
                        example, you could use this to set up cluster
                        submission using --additional-params --cluster 'YOUR-
                        CLUSTER-SUBMISSION-CMD'. (default: None)

anvi-scan-trnas

Identify and store tRNA genes in a contigs database

Usage

usage: anvi-scan-trnas [-h] -c CONTIGS_DB [-T NUM_THREADS]
                       [--log-file FILE_PATH] [--trna-hits-file FILE_PATH]
                       [--trna-cutoff-score INT] [--just-do-it]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --log-file FILE_PATH  File path to store debug/output messages. (default:
                        None)
  --trna-hits-file FILE_PATH
                        File path to store raw hits from tRNA scan. (default:
                        None)
  --trna-cutoff-score INT
                        Minimum score to assume a hit comes from a proper tRNA
                        gene (passed to the tRNAScan-SE). The default is 20.
                        It can get any value between 0-100.
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-search-functions

Search functions in an anvi'o contigs database or genomes storage. Basically, this program searches for one or more search terms you define in functional annotations of genes in an anvi'o contigs database, and generates multiple reports. The default report simply tells you which contigs contain genes with functions matching to serach terms you used, useful for viewing in the interface. You can also request a much more comprehensive report, which gives you anything you might need to know for each hit and serach term

Usage

usage: anvi-search-functions [-h] [-c CONTIGS_DB] [-p PAN_DB]
                             [-g GENOMES_STORAGE] --search-terms SEARCH_TERMS
                             [--delimiter CHAR]
                             [--annotation-sources SOURCE NAME[S]] [-l]
                             [-o FILE_PATH] [--full-report FILE_NAME]
                             [--include-sequences] [--verbose]

Parameters

SEARCH IN: Relevant source databases

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

SEARCH FOR: Relevant terms

  --search-terms SEARCH_TERMS
                        Search terms. Multiple of them can be declared
                        separated by a delimiter (the default is a comma).
                        (default: None)
  --delimiter CHAR      The delimiter to parse multiple input terms. The
                        default is ','.
  --annotation-sources SOURCE NAME[S]
                        Get functional annotations for a specific list of
                        annotation sources. You can specify one or more
                        sources by separating them from each other with a
                        comma character (i.e., '--annotation-sources
                        source_1,source_2,source_3'). The default behavior is
                        to return everything (default: None)
  -l, --list-annotation-sources
                        List available functional annotation sources.
                        (default: False)

REPORT: Anvi'o can report the hits in multiple ways. The output file will be a very simple 2-column TAB-delimited output that is compatible with anvi'o additional data format (so you can give it to the anvi-interactive to see which splits contained genes that were matching to your search terms). You can also ask anvi'o to generate a full-report, that contains much more and much helpful information about each hit. Optionally you can even ask the gene sequences to appear in this report.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --full-report FILE_NAME
                        Optional output file with a fuller description of
                        findings. (default: None)
  --include-sequences   Include sequences in the report. (default: False)
  --verbose             Be verbose, print more messages whenever possible. You
                        may regret this. (default: False)

anvi-search-palindromes

A program to find palindromes in sequences

Usage

usage: anvi-search-palindromes [-h] [-c CONTIGS_DB] [-f FASTA file]
                               [--dna-sequence DNA SEQ] [-l INT] [-m INT]
                               [-d INT] [--blast-word-size INT]
                               [-T NUM_THREADS] [-o FILE_PATH] [--verbose]

Parameters

SEQUENCE SOURCE: Where should anvi'o find your sequences?

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -f FASTA file, --fasta-file FASTA file
                        A FASTA-formatted input file. (default: None)
  --dna-sequence DNA SEQ
                        Literally a DNA sequence. For the very lazy. (default:
                        None)

PALINDROME BUT HOW?: Some essential stuff here about your palindromes. Please also see the search sensitivity section below.

  -l INT, --min-palindrome-length INT
                        The minimum palindrome length. (default: 10)
  -m INT, --max-num-mismatches INT
                        The maximum number of mismatches allowed. (default: 0)
  -d INT, --min-distance INT
                        The minimum distance between the palindromic sequences
                        (this parameter is essentially asking for the number
                        of `x` in the sequence `ATCGxxxCGAT`). The default is
                        50, which means the algorithm will never report by
                        default sequences that are like `ATCGCGAT` with no
                        gaps between the palindrome where the palindromic
                        sequence matches itself (but you can get such
                        palindromes by setting this parameter to 0). (default:
                        50)

SEARCH SENSITIVITY & PERFORMANCE: Yes.

  --blast-word-size INT
                        This parameter is passed to blastn as the
                        `-word_size`, which literally means in the BLAST world
                        the length of best perfect among your alignments. The
                        shorter the word size, the more short palindromes you
                        will find, but it will also influence your ability to
                        find palindromes with mismatches. For instance, if the
                        word size is 10, then you will not find a palindrome
                        that is 18 nt long but have a mismatch right at the
                        9th nucleotide (because individual perfect matches in
                        the alignment will have word sizes of less than 10).
                        So if you want to search for palindromes with a lot of
                        mismatches, then you would like to keep your word size
                        small. But smaller word sizes will impact your
                        performance negatively, and very small ones will
                        simply make it impossible to finish running for
                        especially long contigs. If you want to find
                        palindromes with no mismatches, then you can safely
                        match the word size to the minimum palindrome length.
                        If you want to do some test run, take a DNA sequence
                        (say about 1,000 nts) that contains a palindrome that
                        looks like the kinds of palindromes you will be
                        interested in finding, and run the program `anvi-
                        search-palindromes` with `--dna` parameter and
                        `--verbose` flag. As you play with the word size,
                        minimum number of mismatches, and the minimum
                        palindrome length, the output messages will help you
                        determine your best parameters. (default: 10)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

OUTPUT: Output options.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --verbose             Be verbose, print more messages whenever possible. You
                        may regret this. (default: False)

anvi-search-sequence-motifs

A program to find one or more sequence motifs in contig or gene sequences, and store their frequencies

Usage

usage: anvi-search-sequence-motifs [-h] -c CONTIGS_DB [-p PROFILE_DB]
                                   [--genes-db GENES_DB] --motifs MOTIFS
                                   [-o FILE_PATH] [--store-in-db]

Parameters

SEQUENCES: A contigs database, essentially, to search for motifs.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

OPTIONAL DBs: This program can store the frequencies of your motifs into profile or gene databases. See the online documentation at the end of the help menu for details.

  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  --genes-db GENES_DB   Anvi'o genes database (default: None)

MOTIFS: Sequences to search for..

  --motifs MOTIFS       The motif sequence. You can search for more than one,
                        in which caseyou should use comma (',') to separate
                        them from each other. (default: None)

OUTPUT: Output options. The output file is the obvious option. But if you provided a profile or genes database, AND if you use the flag --store-in-db, then anvi'o will also store the motif frequencies in your databases as items additional data

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --store-in-db         Store analysis results into the database directly.
                        (default: False)

anvi-self-test

A script for anvi'o to test itself

Usage

usage: anvi-self-test [-h] [--suite SUITE] [-o DIR_PATH] [--no-interactive]

Parameters

optional arguments:

  --suite SUITE         A suite of component tests to execute. By default this
                        program will execute the mini test of anvi'o, which
                        will help you to see if your computer and installation
                        is able to perform some of the most basic anvi'o
                        operations, such as generating an anvi'o contigs
                        database, profiling BAM files, or starting an
                        interactive interface. But you are welcome to execute
                        different component tests. Here is a list of what is
                        available to you: 'mini', 'metagenomics-full',
                        'pangenomics', 'interactive-interface', 'metabolism',
                        'display-functions', 'trnaseq', 'workflow-contigs',
                        'workflow-metagenomics', 'workflow-pangenomics',
                        'workflow-phylogenomics' (default: mini)
  -o DIR_PATH, --output-dir DIR_PATH
                        If you declare an output dir, all your data will be
                        stored in there, instead of being stored in a
                        temporary directory to be deleted once the tests are
                        done. This is particularly useful if you wish to play
                        with general anvi'o output files (default: None)
  --no-interactive      Don't show anything interactive (if possible).
                        (default: False)

anvi-setup-interacdome

Setup InteracDome data

Example uses and other resources

Usage

usage: anvi-setup-interacdome [-h] [--interacdome-data-dir PATH] [--reset]

Parameters

optional arguments:

  --interacdome-data-dir PATH
                        The path for the interacdome data to be stored. If you
                        leave it as is without specifying anything, anvi'o
                        will set up everything in a pre-defined default
                        directory. The advantage of using the default
                        directory at the time of set up is that every user of
                        anvi'o on a computer system will be using a single
                        data directory, but then you may need to run the setup
                        program with superuser privileges. If you don't have
                        superuser privileges, then you can use this parameter
                        to tell anvi'o the location you wish to use to setup
                        your data. (default: None)
  --reset               Remove all the previously stored files and start over.
                        If something is feels wrong for some reason and if you
                        believe re-downloading files and setting them up could
                        address the issue, this is the flag that will tell
                        anvi'o to act like a real computer scientist
                        challenged with a computational problem. (default:
                        False)

anvi-setup-kegg-kofams

Download and setup KEGG KOfam HMM profiles and KEGG MODULE data

Usage

usage: anvi-setup-kegg-kofams [-h] [--kegg-data-dir KEGG_DATA_DIR]
                              [--kegg-archive KEGG_ARCHIVE] [-D]
                              [--kegg-snapshot RELEASE_NUM] [--reset]
                              [--just-do-it]

Parameters

POSSIBLE INPUT: Not required for this program to run, but could be useful. Note that if you provide no parameters, this program will download the frozen snapshot of the KEGG databases that is associated with the latest release of anvi'o.

  --kegg-data-dir KEGG_DATA_DIR
                        The directory path for your KEGG setup, which will
                        include things like KOfam profiles and KEGG MODULE
                        data. Anvi'o will try to use the default path if you
                        do not specify anything. (default: None)
  --kegg-archive KEGG_ARCHIVE
                        The path to an archived (.tar.gz) KEGG directory
                        (which you have downloaded from figshare or from a
                        collaborator who has a KEGG data directory generated
                        by anvi'o). If you provide this parameter, anvi'o will
                        set up the KEGG data directory from the archive you
                        specify rather than downloading and setting up our
                        default KEGG archive. (default: None)
  -D, --download-from-kegg
                        This flag is for those people who always need the
                        latest data. You know who you are :) By default, this
                        program will set up a snapshot of the KEGG databases,
                        which will be dated to the time of the anvi'o release
                        that you are currently working with. The pros of this
                        are that the KEGG data will be the same for everyone
                        (which makes sharing your KEGG-annotated datasets
                        easy), and you will not have to worry about updating
                        your datasets with new annotations every time that
                        KEGG updates. However, KEGG updates regularly, so the
                        con of this is that you will not have the most up-to-
                        date version of KEGG for your annotations, metabolism
                        estimations, or any other downstream uses of this
                        data. If that is going to be a problem for you, do not
                        fear - you can provide this flag to tell anvi'o to
                        download the latest, freshest data directly from
                        KEGG's REST API and set it up into an
                        anvi'o-compatible database. (default: False)
  --kegg-snapshot RELEASE_NUM
                        If you are particularly interested in an earlier
                        snapshot of KEGG that anvi'o knows about, you can set
                        it here. Otherwise anvi'o will always use the latest
                        snapshot it knows about, which is likely to be the one
                        associated with the current release of anvi'o.
                        (default: None)
  --reset               Remove all the previously stored files and start over.
                        If something is feels wrong for some reason and if you
                        believe re-downloading files and setting them up could
                        address the issue, this is the flag that will tell
                        anvi'o to act like a real computer scientist
                        challenged with a computational problem. (default:
                        False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-setup-ncbi-cogs

Download and setup NCBI's Clusters of Orthologous Groups database

Usage

usage: anvi-setup-ncbi-cogs [-h] [--cog-version COG_VERSION]
                            [--cog-data-dir COG_DATA_DIR] [--reset]
                            [--just-do-it] [-T NUM_THREADS]

Parameters

optional arguments:

  --cog-version COG_VERSION
                        COG version. The default is the latest version, which
                        is COG20, meaning that anvi'o will use the NCBI's 2020
                        release of COGs to setup the database and run it on
                        contigs databases. There is also an older version of
                        COGs from 2014. If you would like anvi'o to work with
                        that one, please use COG14 as a parameter. On a single
                        computer you can have both, and on a single contigs
                        database you can run both. Cool and confusing. The
                        anvi'o way. (default: None)
  --cog-data-dir COG_DATA_DIR
                        The directory for COG data to be stored. If you leave
                        it as is without specifying anything, the default
                        destination for the data directory will be used to set
                        things up. The advantage of it is that everyone will
                        be using a single data directory, but then you may
                        need superuser privileges to do it. Using this
                        parameter you can choose the location of the data
                        directory somewhere you like. However, when it is time
                        to run COGs, you will need to remember that path and
                        provide it to the program. (default: None)
  --reset               Remove all the previously stored files and start over.
                        If something is feels wrong for some reason and if you
                        believe re-downloading files and setting them up could
                        address the issue, this is the flag that will tell
                        anvi'o to act like a real computer scientist
                        challenged with a computational problem. (default:
                        False)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

anvi-setup-pdb-database

Setup or update an offline database of representative PDB structures clustered at 95%

Usage

usage: anvi-setup-pdb-database [-h] [--pdb-database-path PATH]
                               [-T NUM_THREADS] [--update]
                               [--skip-modeller-update] [--reset]

Parameters

optional arguments:

  --pdb-database-path PATH
                        The path for the PDB database to be stored. If you
                        leave it as is without specifying anything, anvi'o
                        will set up everything in a pre-defined default
                        directory. The advantage of using the default
                        directory at the time of set up is that every user of
                        anvi'o on a computer system will be using a single
                        data directory, but then you may need to run the setup
                        program with superuser privileges. If you don't have
                        superuser privileges, then you can use this parameter
                        to tell anvi'o the location you wish to use to setup
                        your database. (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --update              Use this flag if you would like to update your current
                        database. (default: False)
  --skip-modeller-update
                        By default, MODELLER's search DB is updated when this
                        program is ran so that if MODELLER finds a protein,
                        its structure is guaranteed to be in this database. If
                        you don't want to touch the MODELLER database, use
                        this flag. (default: False)
  --reset               Remove all the previously stored files and start over.
                        If something is feels wrong for some reason and if you
                        believe re-downloading files and setting them up could
                        address the issue, this is the flag that will tell
                        anvi'o to act like a real computer scientist
                        challenged with a computational problem. (default:
                        False)

anvi-setup-pfams

Download and setup Pfam data from the EBI

Usage

usage: anvi-setup-pfams [-h] [--pfam-data-dir PFAM_DATA_DIR] [--reset]
                        [--pfam-version PFAM_VERSION]

Parameters

optional arguments:

  --pfam-data-dir PFAM_DATA_DIR
                        The directory for Pfam data to be stored. If you leave
                        it as is without specifying anything, the default
                        destination for the data directory will be used to set
                        things up. The advantage of it is that everyone will
                        be using a single data directory, but then you may
                        need superuser privileges to do it. Using this
                        parameter you can choose the location of the data
                        directory somewhere you like. However, when it is time
                        to run Pfam, you will need to remember that path and
                        provide it to the program. (default: None)
  --reset               This program by default attempts to use previously
                        downloaded files in your Pfam data directory if there
                        are any. If something is wrong for some reason you can
                        use this to tell anvi'o to remove everything, and
                        start over. (default: False)
  --pfam-version PFAM_VERSION
                        By default, the most current version available will be
                        downloaded. If you have specific tastes for a
                        different version, you can provide it here. For
                        example, `31.0`. Here are all possible versions:
                        ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/
                        (default: None)

anvi-setup-scg-taxonomy

The purpose of this program is to download necessary information from GTDB (https://gtdb.ecogenomic.org/), and set it up in such a way that your anvi'o installation is able to assign taxonomy to single-copy core genes using anvi-run-scg-taxonomy and estimate taxonomy for genomes or metagenomes using anvi-estimate-scg-taxonomy)

Example uses and other resources

Usage

usage: anvi-setup-scg-taxonomy [-h] [-T NUM_THREADS]
                               [--scgs-taxonomy-data-dir PATH]
                               [--gtdb-release RELEASE_NUM] [--reset]
                               [--redo-databases]

Parameters

optional arguments:

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --scgs-taxonomy-data-dir PATH
                        The directory for SCGs data to be stored (or read
                        from, depending on the context). If you leave it as is
                        without specifying anything, anvi'o will set up
                        everything in (or try to read things from) a pre-
                        defined default directory. The advantage of using the
                        default directory at the time of set up is that every
                        user of anvi'o on a computer system will be using a
                        single data directory, but then you may need to run
                        the setup program with superuser privileges. If you
                        don't have superuser privileges, then you can use this
                        parameter to tell anvi'o the location you wish to use
                        to setup your databases. If you are using a program
                        (such as `anvi-run-scg-taxonomy` or `anvi-estimate-
                        scg-taxonomy`) you will have to use this parameter to
                        tell those programs where your data are. (default:
                        None)
  --gtdb-release RELEASE_NUM
                        If you are particularly intersted an earlier release
                        anvi'o knows about, you can set it here Otherwise
                        anvi'o will always use the latest release it knows
                        about. (default: None)
  --reset               Remove all the previously stored files and start over.
                        If something is feels wrong for some reason and if you
                        believe re-downloading files and setting them up could
                        address the issue, this is the flag that will tell
                        anvi'o to act like a real computer scientist
                        challenged with a computational problem. (default:
                        False)
  --redo-databases      Remove existing databases and re-create them. This can
                        be necessary when versions of programs change and
                        databases they create and use become incompatible.
                        (default: False)

anvi-setup-trna-taxonomy

The purpose of this program is to setup necessary databases for tRNA genes collected from GTDB (https://gtdb.ecogenomic.org/), genomes in your local anvi'o installation so taxonomy information for a given set of tRNA sequences can be identified using anvi-run-trna-taxonomy and made sense of via anvi-estimate-trna-taxonomy)

Usage

usage: anvi-setup-trna-taxonomy [-h] [-T NUM_THREADS]
                                [--trna-taxonomy-data-dir PATH] [--reset]
                                [--redo-databases]

Parameters

optional arguments:

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --trna-taxonomy-data-dir PATH
                        The directory for tRNA taxonomy data to be stored (or
                        read from, depending on the context). If you leave it
                        as is without specifying anything, anvi'o will set up
                        everything in (or try to read things from) a pre-
                        defined default directory. The advantage of using the
                        default directory at the time of set up is that every
                        user of anvi'o on a computer system will be using a
                        single data directory, but then you may need to run
                        the setup program with superuser privileges. If you
                        don't have superuser privileges, then you can use this
                        parameter to tell anvi'o the location you wish to use
                        to setup your databases. If you are using a program
                        (such as `anvi-run-trna-taxonomy` or `anvi-estimate-
                        trna-taxonomy`) you will have to use this parameter to
                        tell those programs where your data are. (default:
                        None)
  --reset               Remove all the previously stored files and start over.
                        If something is feels wrong for some reason and if you
                        believe re-downloading files and setting them up could
                        address the issue, this is the flag that will tell
                        anvi'o to act like a real computer scientist
                        challenged with a computational problem. (default:
                        False)
  --redo-databases      Remove existing databases and re-create them. This can
                        be necessary when versions of programs change and
                        databases they create and use become incompatible.
                        (default: False)

anvi-show-collections-and-bins

A script to display collections stored in an anvi'o profile or pan database

Usage

usage: anvi-show-collections-and-bins [-h] -p PAN_OR_PROFILE_DB

Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)

anvi-show-misc-data

Show all misc data keys in all misc data tables

Usage

usage: anvi-show-misc-data [-h] [-p PAN_OR_PROFILE_DB] [-c CONTIGS_DB]
                           [-t NAME] [-D NAME]

Parameters

Database input: Provide 1 of these

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

Details: Everything else.

  -t NAME, --target-data-table NAME
                        The target table is the table you are interested in
                        accessing. Currently it can be 'items','layers', or
                        'layer_orders'. Please see most up-to-date online
                        documentation for more information. (default: None)
  -D NAME, --target-data-group NAME
                        Data group to focus. Anvi'o misc data tables support
                        associating a set of data keys with a data group. If
                        you have no idea what this is, then probably you don't
                        need it, and anvi'o will take care of you. Note: this
                        flag is IRRELEVANT if you are working with additional
                        order data tables. (default: None)

anvi-split

Split an anvi'o pan or profile database into smaller, self-contained pieces. Provide either a genomes-storage and pan database or a profile and contigs database pair, and you'll get back directories of individual projects for each bin that can be treated as smaller anvi'o projects

Example uses and other resources

Usage

usage: anvi-split [-h] -p PAN_OR_PROFILE_DB [-c CONTIGS_DB]
                  [-g GENOMES_STORAGE] [--skip-variability-tables]
                  [--compress-auxiliary-data] [-C COLLECTION_NAME]
                  [-b BIN_NAME] [-o DIR_PATH] [--list-collections]
                  [--skip-hierarchical-clustering]
                  [--enforce-hierarchical-clustering]
                  [--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]

Parameters

DATABASES: You will either provide a PROFILE/CONTIGS or a PAN/GENOMES STORAGE pair here.

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

PROFILE/CONTIGS OPTIONS: Some options that are specific to this only.

  --skip-variability-tables
                        Processing variability tables in profile database
                        might take a very long time. With this flag you will
                        be asking anvi'o to skip them. (default: False)
  --compress-auxiliary-data
                        When declared, the auxiliary data file in the
                        resulting output will be compressed. This saves space,
                        but it takes long. Also, if you are planning to
                        compress the entire later using GZIP, it is even
                        useless to do. But you are the boss! (default: False)

COLLECTION: You should provide a valid collection name. If you do not provide bin names, the program will generate an output for each bin in your collection separately.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)

OUTPUT: Where do we want the resulting split profiles to be stored.

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)

EXTRAS: Stuff that you rarely need, but you really really need when the time comes. Following parameters will aply to each of the resulting anvi'o profile that will be split from the mother anvi'o profile.

  --list-collections    Show available collections and exit. (default: False)
  --skip-hierarchical-clustering
                        If you are not planning to use the interactive
                        interface (or if you have other means to add a tree of
                        contigs in the database) you may skip the step where
                        hierarchical clustering of your items are preformed
                        based on default clustering recipes matching to your
                        database type. (default: False)
  --enforce-hierarchical-clustering
                        If you have more than 25,000 splits in your merged
                        profile, anvi-merge will automatically skip the
                        hierarchical clustering of splits (by setting --skip-
                        hierarchical-clustering flag on). This is due to the
                        fact that computational time required for hierarchical
                        clustering increases exponentially with the number of
                        items being clustered. Based on our experience we
                        decided that 25,000 splits is about the maximum we
                        should try. However, this is not a theoretical limit,
                        and you can overwrite this heuristic by using this
                        flag, which would tell anvi'o to attempt to cluster
                        splits regardless. (default: False)
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        If you do not use this flag, the default distance
                        metric will be used for each clustering configuration
                        which is "euclidean". (default: None)
  --linkage LINKAGE_METHOD
                        The same story with the `--distance`, except, the
                        system default for this one is ward. (default: None)

anvi-summarize

Summarizer for anvi'o pan or profile db's. Essentially, this program takes a collection id along with either a profile database and a contigs database or a pan database and a genomes storage and generates a static HTML output for what is described in a given collection. The output directory will contain almost everything any downstream analysis may need, and can be displayed using a browser without the need for an anvi'o installation. For this reason alone, reporting summary outputs as supplementary data with publications is a great idea for transparency and reproducibility

Example uses and other resources

Usage

usage: anvi-summarize [-h] -p PAN_OR_PROFILE_DB [-c CONTIGS_DB]
                      [-g GENOMES_STORAGE] [--init-gene-coverages]
                      [--reformat-contig-names]
                      [--report-aa-seqs-for-gene-calls]
                      [--report-DNA-sequences] [-C COLLECTION_NAME]
                      [-o DIR_PATH] [--list-collections]
                      [--cog-data-dir COG_DATA_DIR] [--quick-summary]
                      [--just-do-it]

Parameters

PROFILE: The profile. It could be a standard or pan profile database.

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)

PROFILE TYPE SPECIFIC PARAMETERS: If you are summarizing a collection stored in a standard anvi'o profile, you will need a contigs database to go with it. If you are working with a pan profile, then you will need to provide a genomes storage. Don't worry too much, because anvi'o will warn you gently if you make a mistake.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

STANDARD PROFILE SPECIFIC PARAMS: Parameters that are only relevant to standard profile summaries (declaring or not declaring them will not change anything if you are summarizing a pan profile).

  --init-gene-coverages
                        Initialize gene coverage and detection data. This is a
                        very computationally expensive step, but it is
                        necessary when you need gene level coverage data. The
                        reason this is very computationally expensive is
                        because anvi'o computes gene coverages by going back
                        to actual coverage values of each gene to average
                        them, instead of using contig average coverage values,
                        for extreme accuracy. (default: False)
  --reformat-contig-names
                        Reformat contig names while generating the summary
                        output so they look fancy. With this flag, anvi'o will
                        replace the original names of contigs to those that
                        include the bin name as a prefix in resulting summary
                        output files per bin. Use this flag carefully as it
                        may influence your downstream analyses due to the fact
                        that your original contig names in your input FASTA
                        file for the contigs database will not be in the
                        summary output. Although, anvi'o will report a
                        conversion map per bin so you can recover the original
                        contig name if you have to. (default: False)
  --report-aa-seqs-for-gene-calls
                        You can use this flag if you would like amino acid AND
                        dna sequences for your gene calls in the genes output
                        file. By default, only dna sequences are reported.
                        (default: False)

PAN PROFILE SPECIFIC PARAMS: Parameters that are only relevant to pan profile summaries (declaring or not declaring them will not change anything if you are summarizing a standard profile).

  --report-DNA-sequences
                        By default, this program reports amino acid sequences.
                        Use this flag to report DNA sequences instead. Also
                        note, since gene clusters are aligned via amino acid
                        sequences, using this flag removes alignment
                        information manifesting in the form of gap characters
                        (`-` characters) that would be present if amino acid
                        sequences were reported. Read the warnings during
                        runtime for more detailed information. (default:
                        False)

COMMONS: Common parameters for both pan and standard profile summaries.

  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  --list-collections    Show available collections and exit. (default: False)
  --cog-data-dir COG_DATA_DIR
                        The directory path for your COG setup. Anvi'o will try
                        to use the default path if you do not specify
                        anything. (default: None)
  --quick-summary       When declared the summary output will be generated as
                        quickly as possible, with minimum amount of essential
                        information about bins. (default: False)

EXTRA: Extra stuff because you're extra.

  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-summarize-blitz

FAST summary of many anvi'o single profile databases (without having to use the program anvi-merge).

Usage

usage: anvi-summarize-blitz [-h] -c CONTIGS_DB -C COLLECTION_NAME
                            [-o FILE_PATH] [--stats-to-summarize STATS]
                            SINGLE_PROFILES) [SINGLE_PROFILE(S ...]

Parameters

positional arguments:

  SINGLE_PROFILE(S)     Anvo'o single profiles to summarize. All profiles
                        should be associated with the same contigs db.

optional arguments:

  -h, --help            show this help message and exit
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Name of the collection you wish to summarize. The SAME
                        collection will be summarized across all of your input
                        profiles. This collection must be defined in at least
                        the first profile in the argument list. (default:
                        None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --stats-to-summarize STATS, -S STATS
                        Use this flag to indicate which statistics you want
                        summarized, as a comma-separated list. The default
                        stats are 'detection' and 'mean_coverage_Q2Q3'. To see
                        a list of available stats, use this flag and provide
                        an absolutely ridiculous string after it (we suggest
                        'cattywampus', but you do you). (default: None)

anvi-tabulate-trnaseq

A program to write standardized tab-delimited files of tRNA-seq seed coverage and modification results

Usage

usage: anvi-tabulate-trnaseq [-h] -c CONTIGS_DB --specific-profile-db
                             PROFILE_DB [--nonspecific-profile-db PROFILE_DB]
                             [-o DIR_PATH] [-W]

Parameters

MANDATORY:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Path to a `trnaseq`-variant contigs database, as
                        produced by `anvi-merge-trnaseq`. (default: None)
  --specific-profile-db PROFILE_DB, -s PROFILE_DB
                        The path to an anvi'o profile database containing
                        specific coverage information on tRNA seeds. `anvi-
                        merge-trnaseq` generates a specific profile database
                        from a tRNA-seq experiment. (default: None)

OPTIONAL:

  --nonspecific-profile-db PROFILE_DB, -n PROFILE_DB
                        The path to an anvi'o profile database containing
                        nonspecific coverage information on tRNA seeds. `anvi-
                        merge-trnaseq` optionally generates a nonspecific
                        profile database from a tRNA-seq experiment. (default:
                        None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)

anvi-trnaseq

A program to process reads from a tRNA-seq dataset to generate an anvi'o tRNA-seq database

Usage

usage: anvi-trnaseq [-h] [-f FASTA] [-S NAME] [-o DIR_PATH]
                    [--treatment TREATMENT] [-W] [--description TEXT_FILE]
                    [--write-checkpoints]
                    [--load-checkpoint {profile,normalize,map_fragments,substitutions,indels}]
                    [--feature-param-file FILE]
                    [--threeprime-termini THREEPRIME_TERMINI]
                    [--min-length-long-fiveprime INT]
                    [--min-trna-fragment-size INT]
                    [--agglomeration-max-mismatch-freq FLOAT]
                    [--skip-INDEL-profiling] [--max-indel-freq FLOAT]
                    [--left-indel-buffer INT] [--right-indel-buffer INT]
                    [-T NUM_THREADS] [--skip-fasta-check]
                    [--profiling-chunk-size INT]
                    [--alignment-target-chunk-size INT]
                    [--default-feature-param-file OUTPUT_FILENAME]
                    [--print-default-feature-params]

Parameters

MANDATORY:

  -f FASTA, --trnaseq-fasta FASTA
                        The FASTA file containing merged (quality-controlled)
                        tRNA-seq reads from a sample. We recommend generating
                        this file via `anvi-run-workflow -w trnaseq` to ensure
                        proper merging of read pairs that may be partially or
                        fully overlapping, and to automatically produce
                        anvi'o-compliant simple deflines. If there is a
                        problem, anvi'o will gracefully complain about it.
                        (default: None)
  -S NAME, --sample-name NAME
                        Unique sample name, considering all others in the
                        experiment, that only includes ASCII letters and
                        digits, without spaces (default: None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)

EXTRAS:

  --treatment TREATMENT
                        The type of treatment applied during tRNA-seq sample
                        preparation. The values which are currently known to
                        anvi'o are "untreated" and "demethylase", as tRNA-seq
                        samples are commonly split for these treatments.
                        Anvi'o will warn you if you do not choose one of these
                        known options, but it will not affect data processing.
                        Treatment type is stored for further reference in the
                        output tRNA-seq database, and can be used in anvi-
                        merge-trnaseq to affect which nucleotides are called
                        at predicted modification sites in tRNA seed
                        sequences. (default: untreated)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)

ADVANCED:

  --write-checkpoints   Use this flag to write pickle files of intermediate
                        results at key points in anvi-trnaseq. If anvi'o
                        crashes for some reason, the argument, --load-
                        checkpoint, with the associated checkpoint name can be
                        used to restart the program from the given checkpoint.
                        This can be useful for saving time if anvi'o crashes
                        or in comparing the results of different advanced
                        program parameterizations involved in later stages of
                        the analytical pipeline after the checkpoint, such as
                        --min-trna-fragment-size and --agglomeration-max-
                        mismatch-freq. This flag will overwrite existing
                        intermediate files in the output directory as needed.
                        (default: False)
  --load-checkpoint {profile,normalize,map_fragments,substitutions,indels}
                        This option restarts `anvi-trnaseq` from the specified
                        checkpoint. It can be useful for saving time if anvi'o
                        crashed after the checkpoint. It can also be useful in
                        comparing the results of different advanced program
                        parameterizations that are only involved in stages of
                        the analytical pipeline after the checkpoint. `anvi-
                        trnaseq` must previously have been run with the flag,
                        `--write-checkpoints`, so that intermediate checkpoint
                        files were generated. Checkpoint "profile" restarts
                        after tRNA profiling. "normalize" restarts after
                        sequence trimming and normalization. "map_fragments"
                        restarts after non-3' fragments have been mapped to
                        normalized tRNA sequences. "substitutions" restarts
                        after potential modification-induced substitutions
                        have been found. "indels" restarts after modification-
                        induced indels have been found, the last step in tRNA
                        identification. If `--write-checkpoints` is used in
                        conjunction with `--load-checkpoint` then all existing
                        intermediate files from checkpoints following the one
                        being loaded will be overwritten. (default: None)
  --feature-param-file FILE
                        A .ini file can be provided to set tRNA feature
                        parameters used in de novo profiling/identification of
                        tRNA sequences from the 3' end. Generate the default
                        file with the command, `anvi-trnaseq --default-
                        feature-param-file`. Dashes in the default file show
                        parameters that cannot be changed, because they do not
                        exist or are set in stone. For instance, the program
                        only detects base pairing in stems, so only stem
                        features are parameterized with a maximum allowed
                        number of unpaired nucleotides, while every other
                        feature has a dash in the "Number allowed unpaired"
                        column. Two quotes in the default file show parameters
                        that are not currently set. To lift a constraint, a
                        parameter value can be replaced by "". For instance,
                        the conserved purine at D loop/position 21 indicated
                        by the value, 0,R, can be replaced by "" to prevent
                        the program from seeking a conserved nucleotide there.
                        Conserved nucleotides in a feature are set by pairs of
                        zero-based indices and nucleotide symbols. The index
                        indicates the conserved position in the feature,
                        relative to the 5' end of the feature. The nucleotide
                        symbol can be A, C, G, T (U in cDNA), R (purine), or Y
                        (pyrimidine). The index is separated from the symbol
                        by a comma. Multiple conserved positions in a feature
                        are separted by a semicolon. Feature profiling of a
                        sequence halts when the number of allowed unconserved
                        nucleotides in a feature or the number of allowed
                        unpaired positions in a stem is exceeded. The default
                        allowed number of unconserved nucleotides in the D
                        loop, for example, is 1, so 4 of the 5 conserved
                        positions must be found for the D loop to be
                        positively identified. By default, 1 position is
                        allowed to be unpaired (no Watson-Crick or G-T wobble
                        base pair) in each of the 4 stems; the user could, for
                        instance, lift this constraint on the acceptor stem by
                        changing the value from 1 to "". There are 3 variable-
                        length sections of tRNA. The user could, for example,
                        change the allowed lengths of the V loop from a
                        discontinuous range, "4-5,9-23", to a continuous
                        range, "4-23". (default: None)
  --threeprime-termini THREEPRIME_TERMINI
                        Termini represent the subsequences (in the 5'->3'
                        orientation) to expect at the 3' end of a tRNA read
                        adjacent to the discriminator nucleotide. tRNA feature
                        profiling from the 3' end seeks a valid terminus prior
                        to the discriminator and more 5' features. 3' terminal
                        sequences can include the nucleotides, A, C, G, and T,
                        and N, symbolizing any nucleotide. A single
                        underscore, "_", can be included in lieu of a
                        sequence, symbolizing the absence of a terminus such
                        that the tRNA feature profile may end with the
                        discriminator. If "_" is not included, tRNA sequences
                        ending in the discriminator will still be sought as
                        *fragments* of profiled tRNA. The order of sequences
                        in the argument is the order of consideration in
                        profiling. For example, if CCA is the first 3'
                        terminus considered, and it produces a complete
                        profile with no unconserved or unpaired nucleotides,
                        then the other possible termini are not considered.
                        Other termini are only considered with the possibility
                        of "improvement" in the feature profile. (default:
                        CCA,CC,C,CCAN,CCANN)
  --min-length-long-fiveprime INT
                        tRNA reads often extend beyond the 5' end of a mature
                        tRNA sequence. This can be biological in origin when
                        the read is from pre-tRNA; artifactual in origin when
                        the reverse transcriptase runs off the end of the
                        template, adding a small number ofs random bases; or
                        artifactual when the read is a chimera of tRNA at the
                        3' end and another, potentially non-tRNA, transcript
                        at the 5' end. Longer 5' extensions are more likely to
                        be biological than artifactual due to the exclusion of
                        runoff bases. This parameter sets the minimum length
                        of 5' sequence extensions that are recorded in the
                        tRNA-seq database output for further analysis.
                        (default: 4)
  --min-trna-fragment-size INT
                        Anvi'o profiles a sequence as tRNA by identifying tRNA
                        features from the 3' end of the sequence. tRNA-seq
                        datasets can include a significant number of tRNA
                        fragments that are not from the 3' end of the sequence
                        ending in a recognized terminus, e.g., CCA. These
                        "interior" and 5' fragments can be of significant
                        biological interest. Fragments are identified by
                        mapping unprofiled reads to profiled tRNAs that have
                        their 3' termini trimmed off. This parameter sets the
                        minimum length of unprofiled reads searched in this
                        manner. The choice of 25 as the default value is
                        motivated by considerations of false positive matches
                        and computational performance with a shorter minimum
                        sequence length. Since unprofiled reads are mapped to
                        every unique profiled tRNA sequence, a shorter minimum
                        sequence length can make mapping take a very long time
                        and return too many alignments to store in memory for
                        datasets of millions of reads. Pay attention to python
                        memory usage if you adjust this parameter downwards.
  --agglomeration-max-mismatch-freq FLOAT
                        Anvi'o finds potential tRNA modifications by first
                        agglomerating sequences differing from one or more
                        other sequences in the cluster by mismatches at a
                        certain fraction of nucleotides. This parameter sets
                        the maximum mismatch fraction that is allowed, by
                        default 0.03. The value of this parameter is rounded
                        to the nearest hundredth. The default approximates
                        2/71, representing 2 mismatches in a full-length tRNA
                        of length 74, not 71, as 3' sequence variants,
                        including the canonical 3'-CCA, are trimmed off prior
                        to sequences being agglomerated. (Average non-
                        mitochondrial tRNAs range in length from 74-95.) For
                        example, consider 3 trimmed sequences of length 71 --
                        A, B and C -- and 1 sequence of length 65, D. If A
                        differs from B by a substitution at position 1 when
                        aligned (mismatch frequency of 0.014), and C differs
                        from B at positions 10 and 20 (mismatch frequency of
                        0.028), such that C differs from A by 3 substitutions
                        (mismatch frequency of 0.042), then A, B, and C will
                        still agglomerate into a single cluster, as each
                        differs by no more than 2 substitutions from *some
                        other sequence* in the cluster. In contrast, sequence
                        D differs from B at positions 30 and 40 (mismatch
                        frequency of 0.031), exceeding the 0.03 limit needed
                        to agglomerate. D forms its own cluster and is not
                        consolidated into a single modified sequence with the
                        others. (default: 0.03)
  --skip-INDEL-profiling
                        This flag prevents the prediction of deletions in tRNA
                        reads, which can save time. (default: False)
  --max-indel-freq FLOAT
                        The maximum indel frequency constrains the number and
                        length of modification-induced indels that can be
                        found. The value of this parameter is rounded to the
                        nearest hundredth. Anvi'o identifies tRNAs with
                        potential modification-induced substitutions before
                        finding indels. tRNAs with substitutions are aligned
                        with other sequences to find sequences differing only
                        by indels. The default parameter value of 0.05 allows
                        1 indel of length 3 to be found in a modified sequence
                        of length 71. (Modified sequences have the canonical
                        3'-CCA trimmed off, so a sequence of length 71
                        represents the low end of the non-mitochondrial tRNA
                        length range of 74-95.) The default equivalently
                        allows 2 indels of lengths 1 and 2 or 3 indels of
                        length 1 in a sequence of length 71. An indel of
                        length 4 would result in a frequency of 0.056 and so
                        would not be considered. (default: 0.05)
  --left-indel-buffer INT
                        This parameter sets the distance an indel must lie
                        from the left end of a sequence alignment in the
                        search for modification-induced indels.The default
                        buffer of 3 matches was chosen to prevent nontemplated
                        and variant nucleotides at the 5' end of tRNA reads
                        from being mistakenly identified as indels. (default:
                        3)
  --right-indel-buffer INT
                        This parameter sets the distance an indel must lie
                        from the right end of a sequence alignment in the
                        search for modification-induced indels. The default
                        buffer of 3 matches was chosen to prevent variant
                        nucleotides at the 3' end of tRNA reads from being
                        mistakenly identified as indels. (default: 3)

PERFORMANCE:

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --skip-fasta-check    Don't check the input FASTA file for such things as
                        proper defline formatting to speed things up.
                        (default: False)
  --profiling-chunk-size INT
                        Anvi'o manages memory consumption during tRNA feature
                        profiling by chunking the unique input sequences. This
                        parameter sets the maximum number of sequences in each
                        chunk. Adjustment of this parameter has little effect
                        on speed. (default: 100000)
  --alignment-target-chunk-size INT
                        Anvi'o sequence alignment manages memory consumption
                        by chunking the list of alignment targets, so that
                        queries are aligned to the first chunk of targets,
                        then the second chunk, and so on. This parameter sets
                        the maximum number of target sequences in each chunk.
                        Memory management becomes important when aligning
                        short queries to a large number of targets, which
                        involves searching queries against a massive number of
                        k-mers (equal in length to the shortest query) that
                        have been extracted from targets. Adjust this
                        parameter downward if your system runs out of memory
                        during alignment; adjust this parameter upward to
                        speed up alignment if you find that you are not
                        memory-limited. Ideally, we would set this parameter
                        using a heuristic function parameterized with the
                        numbers and lengths of query and target sequences...
                        (default: 25000)

DEFAULTS:

  --default-feature-param-file OUTPUT_FILENAME
                        Writes a tab-delimited .ini file containing default
                        tRNA feature parameterizations used in de novo
                        profiling/identification of tRNA sequences from the 3'
                        end. Parameters can be modified by the user and the
                        file fed back into anvi-trnaseq through the --feature-
                        param-file argument, the help description of which
                        explains the file format. (default: None)
  --print-default-feature-params
                        Prints to standard output a nicely formatted table of
                        the default tRNA feature parameterizations (which can
                        be written to a tab-delimited .ini file by the option,
                        --default-feature-param-file). (default: False)

anvi-update-db-description

Update the description in an anvi'o database

Usage

usage: anvi-update-db-description [-h] --description TEXT_FILE DB

Parameters

positional arguments:

  DB                    An anvi'o database.

optional arguments:

  -h, --help            show this help message and exit
  --description TEXT_FILE
                        A plain text file that contains some description about
                        the project. You can use Markdown syntax. The
                        description text will be rendered and shown in all
                        relevant interfaces, including the anvi'o interactive
                        interface, or anvi'o summary outputs. (default: None)

anvi-update-structure-database

Add or re-run genes from an already existing structure database. All settings used to generate your database will be used in this program

Usage

usage: anvi-update-structure-database [-h] -c CONTIGS_DB -s STRUCTURE_DB
                                      [--genes-of-interest FILE]
                                      [--gene-caller-ids GENE_CALLER_IDS]
                                      [--external-structures FILE_PATH]
                                      [--dump-dir DUMP_DIR]
                                      [--list-modeller-params] [--rerun-genes]
                                      [--modeller-executable MODELLER_EXECUTABLE]
                                      [-T NUM_THREADS]
                                      [--write-buffer-size-per-thread INT]

Parameters

DATABASES: Declaring relevant anvi'o databases. First things first.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -s STRUCTURE_DB, --structure-db STRUCTURE_DB
                        Anvi'o structure database. (default: None)

GENES: Specify which genes you want to be modelled. If a gene already exists in the DB, it will be overwritten if –overwrite is set. Otherwise, an error will be raised.

  --genes-of-interest FILE
                        A file with anvi'o gene caller IDs. There should be
                        only one column in the file, and each line should
                        correspond to a unique gene caller id (without a
                        column header). (default: None)
  --gene-caller-ids GENE_CALLER_IDS
                        Gene caller ids. Multiple of them can be declared
                        separated by a delimiter (the default is a comma). In
                        anvi-gen-variability-profile, if you declare nothing
                        you will get all genes matching your other filtering
                        criteria. In other programs, you may get everything,
                        nothing, or an error. It really depends on the
                        situation. Fortunately, mistakes are cheap, so it's
                        worth a try. (default: None)
  --external-structures FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        PDB protein structures. The first item in the header
                        line should read 'gene_callers_id', and the second
                        should read 'path'. Each line in the file should
                        describe a single entry, where the first column is the
                        gene_callers_id that the structure corresponds to, and
                        the second column is the path to the structure file.
                        (default: None)

OUTPUT: Output file and output style.

  --dump-dir DUMP_DIR   Modeling and annotating structures requires a lot of
                        moving parts, each which have their own outputs. The
                        output of this program is a structure database
                        containing the pertinent results of this computation,
                        however a lot of stuff doesn't make the cut. By
                        providing a directory for this parameter you will get,
                        in addition to the structure database, a directory
                        containing the raw output for everything. (default:
                        None)

MODELLER PARAMS: Parameters for MODELLER's homology modeling.

  --list-modeller-params
                        Since you are updating an existing DB, modeller params
                        are set in place. You can have this program list them
                        by providing this flag (default: False)

EXTRA: Everything else.

  --rerun-genes         Supply if you would like to rerun structural modelling
                        for your genes of interest if they are already present
                        in your DB (default: False)
  --modeller-executable MODELLER_EXECUTABLE
                        The MODELLER program to use. For example, `mod9.19`.
                        Anvi'o will try and find it if not provided. (default:
                        None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --write-buffer-size-per-thread INT
                        How many items should be kept in memory before they
                        are written do the disk. The default is 25 per thread.
                        So a single-threaded job would have a write buffer
                        size of 25, whereas a job with 4 threads would have a
                        write buffer size of 4*25. The larger the buffer size,
                        the less frequent the program will access to the disk,
                        yet the more memory will be consumed since the
                        processed items will be cleared off the memory only
                        after they are written to the disk. The default buffer
                        size will likely work for most cases. Please keep an
                        eye on the memory usage output to make sure the memory
                        use never exceeds the size of the physical memory. If
                        --num-threads is 1, this parameter is ignored because
                        the DB is written to after each gene

anvi-upgrade

Download and install minor releases of anvi'o from a Github repository

Usage

usage: anvi-upgrade [-h] [--repository REPOSITORY]

Parameters

optional arguments:

  --repository REPOSITORY
                        Source repository to download releases, currently only
                        Github is supported. Enter in 'merenlab/anvio' format.
                        (default: merenlab/anvio)

anvi-script-add-default-collection

A script to add a 'DEFAULT' collection in an anvi'o pan or profile database with a bin named 'EVERYTHING' that describes all items available in the profile database

Usage

usage: anvi-script-add-default-collection [-h] -p PAN_OR_PROFILE_DB
                                          [-c CONTIGS_DB] [-b BIN_NAME]
                                          [-C COLLECTION_NAME]

Parameters

optional arguments:

  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB
                        Anvi'o pan or profile database (and even genes
                        database in appropriate contexts). (default: None)
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Name for the new bin. If you don't provide any then it
                        will be named "EVERYTHING". (default: EVERYTHING)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Name for the new collection. If you don't provide any
                        then it will be named "DEFAULT". (default: DEFAULT)

anvi-script-augustus-output-to-external-gene-calls

Takes in gene calls by AUGUSTUS v3.3.3, generates an anvi'o external gene calls file. It may work well with other versions of AUGUSTUS, too. It is just no one has tested the script with different versions of the program

Usage

usage: anvi-script-augustus-output-to-external-gene-calls [-h] -i INPUT_FILE
                                                          [-o FILE_PATH]
                                                          [--just-do-it]

Parameters

optional arguments:

  -i INPUT_FILE, --input-file INPUT_FILE
                        Gene calls file from AUGUSTUS (that ends with .gff).
                        Please note that the script is only tested with
                        AUGUSTUS v3.3.3 output (although it may still work
                        with other versions of the program). (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-script-checkm-tree-to-interactive

A helper script to convert CheckM trees into anvio interactive with taxonomy information

Usage

usage: anvi-script-checkm-tree-to-interactive [-h] -t CHECKM TREE -o DIRECTORY

Parameters

optional arguments:

  -t CHECKM TREE, --tree CHECKM TREE
                        Tree file generated by CheckM. (default: None)
  -o DIRECTORY, --output-dir DIRECTORY
                        The directory name that output files will be stored.
                        (default: None)

anvi-script-compute-ani-for-fasta

Run ANI between contigs in a single FASTA file

Usage

usage: anvi-script-compute-ani-for-fasta [-h] -f FASTA file -o DIR_PATH
                                         [-p PAN_DB] [-T NUM_THREADS]
                                         [--log-file FILE_PATH]
                                         [--method {ANIm,ANIb,ANIblastall,TETRA}]
                                         [--distance DISTANCE_METRIC]
                                         [--linkage LINKAGE_METHOD]
                                         [--just-do-it]

Parameters

optional arguments:

  -f FASTA file, --fasta-file FASTA file
                        A FASTA-formatted input file. (default: None)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)
  --log-file FILE_PATH  File path to store debug/output messages. (default:
                        None)
  --method {ANIm,ANIb,ANIblastall,TETRA}
                        Method for pyANI. The default is ANIb. You must have
                        the necessary binary in path for whichever method you
                        choose. According to the pyANI help for v0.2.7 at
                        https://github.com/widdowquinn/pyani, the method
                        'ANIm' uses MUMmer (NUCmer) to align the input
                        sequences. 'ANIb' uses BLASTN+ to align 1020nt
                        fragments of the input sequences. 'ANIblastall': uses
                        the legacy BLASTN to align 1020nt fragments Finally,
                        'TETRA': calculates tetranucleotide frequencies of
                        each input sequence
  --distance DISTANCE_METRIC
                        The distance metric for the hierarchical clustering.
                        The default is "euclidean".
  --linkage LINKAGE_METHOD
                        The linkage method for the hierarchical clustering.
                        The default is "ward".
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-script-estimate-genome-size

A program to estimate the size of the actual population genome to which a MAG belongs

Usage

usage: anvi-script-estimate-genome-size [-h] -c CONTIGS_DB [--verbose]

Parameters

MANDATORY INPUT: An anvi'o contigs database that hopefully contains a MAG.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

PARAMETERS OF CONVENIENCE: Because life is already very hard as it is.

  --verbose             Be verbose, print more messages whenever possible. You
                        may regret this. (default: False)

anvi-script-filter-fasta-by-blast

Filter FASTA file according to BLAST table (remove sequences with bad BLAST alignment)

Usage

usage: anvi-script-filter-fasta-by-blast [-h] [-f FASTA file] [-o FILE_PATH]
                                         -b TAB DELIMITED FILE -s OUTFMT -t
                                         THRESHOLD [--just-do-it]

Parameters

optional arguments:

  -f FASTA file, --fasta-file FASTA file
                        A FASTA-formatted input file. (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  -b TAB DELIMITED FILE, --blast-output TAB DELIMITED FILE
                        BLAST table generated with blastp. `--outfmt 6` as the
                        output format is assumed. (default: None)
  -s OUTFMT, --outfmt OUTFMT
                        Specify the column ordering of your BLAST report. We
                        add the following paramter to our BLAST searches so
                        the output report contains the `qlen` field, which is
                        not included by default: `-outfmt '6 qseqid sseqid
                        pident length mismatch gapopen qstart qend sstart send
                        evalue bitscore qlen slen'`. You may have used a
                        different `-outfmt` paramter, and you should use this
                        parameter to explicitly define the column names in
                        your output file. For instance, if you had used the
                        parameter mentioned above, then the correct version of
                        this parameter would be: "qseqid sseqid pident length
                        mismatch gapopen qstart qend sstart send evalue
                        bitscore qlen slen". Regardless of the BLAST output
                        format, your columns MUST contain the following
                        parameters for this program to work properly:
                        'qseqid', 'bitscore', 'length', 'qlen', and 'pident'.
                        (default: None)
  -t THRESHOLD, --threshold THRESHOLD
                        What `proper_pident` threshold do you want to use for
                        filtering out sequences whose top bit-score matches
                        have `proper_pident`s less than this threshold? We
                        have defined `proper_pident` to be the percentage of
                        the query amino acids that both aligned to and were
                        identical to the corresponding matched amino acid.
                        Note that the `pident` parameter output by BLAST does
                        not include regions of the query sequence unaligned to
                        the matched sequence, whereas `proper_pident` does.
                        For example, a sequence that's only half aligned by a
                        match but with 100% identity at matched regions has a
                        `pident` of 100 but a `proper_pident` of 50. The
                        default is 30.0%.
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-script-filter-hmm-hits-table

Filter weak HMM hits from a given contigs database using a domain hits table reported by anvi-run-hmms.

Usage

usage: anvi-script-filter-hmm-hits-table [-h] -c CONTIGS_DB
                                         [--hmm-source SOURCE NAME] [-l]
                                         [--domain-hits-table PATH]
                                         [--target-coverage TARGET_COVERAGE]
                                         [--query-coverage QUERY_COVERAGE]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --hmm-source SOURCE NAME
                        Use a specific HMM source. You can use '--list-hmm-
                        sources' flag to see a list of available resources.
                        The default is 'None'.
  -l, --list-hmm-sources
                        List available HMM sources in the contigs database and
                        quit. (default: False)
  --domain-hits-table PATH
                        Please provide the path to the domain-table-output.
                        You can get this file from running anvi-run-hmms with
                        the flag --domain-hits-table. (default: None)
  --target-coverage TARGET_COVERAGE
                        (ali_coord_to - ali_coord_from)/target_length
                        (default: None)
  --query-coverage QUERY_COVERAGE
                        (hmm_coord_to - hmm_coord_from)/hmm_length (default:
                        None)

anvi-script-fix-homopolymer-indels

Corrects homopolymer-region associated INDELs in a given genome based on a reference genome. The most effective use of this script is when the input genome is a genome reconstructed by minION long reads, and the reference genome is one that is of high-quality. Essentially, this script will BLAST the genome you wish to correct against the reference genome you provide, identify INDELs in the BLAST results that are exclusively associated with homopolymer regions, and will take the reference genome as a guide to correct the input sequences, and report a new FASTA file. You can use the output FASTA file that is fixed as the input FASTA file over and over again to see if you can eliminate all homopolymer-associated INDELs

Usage

usage: anvi-script-fix-homopolymer-indels [-h] [-i FASTA] [-r FASTA]
                                          [-o FASTA]
                                          [--min-homopolymer-length INT]
                                          [--verbose] [-T NUM_THREADS]
                                          [--test-run]

Parameters

FILES THAT MATTER: Here you provide file paths for input sequence(s) to be corrected, reference sequence(s) to be used for correction, and the edited input sequences to be stored. UNLESS you just want to do a test run. In which case you don't need any of these but the parameter --test-run

  -i FASTA, --input-fasta FASTA
                        A FASTA file of sequences you wish to fix (default:
                        None)
  -r FASTA, --reference-fasta FASTA
                        A FASTA file for reference sequences (default: None)
  -o FASTA, --output-fasta FASTA
                        Corrected FASTA file (default: None)

STUFF UNDER THE HOOD: Like how should we define homopolymers or how much information should we share with you as we go.

  --min-homopolymer-length INT
                        Minimum number of identical nucleotides next to each
                        other PLUS THE GAP CHARACTER to be considered a
                        homopolymer when these nucleotides are aligned to a
                        region in the other sequnce that is all composed of
                        the same nucleotides. Confused? Read on. The default
                        is 3. So when this value is 2, the program would
                        consider the following to match the definition of
                        minimum homopolymer length to be considered for
                        fixing: (R)eference: 'AA-' and (Q)uery: 'AAA'. The
                        same would be true for R: 'AA---' / Q: 'AAAAA' but not
                        R: 'A-' / Q: 'AA' In contrast, when this value is 3,
                        then the minimum that would work would be this: R:
                        'AAA-', Q: 'AAAA'. Obviously, you shouldn't go any
                        lower than 2, but then why should you listen to a
                        computer?
  --verbose             Be verbose, print more messages whenever possible. You
                        may regret this. (default: False)

PERFORMANCE: For the BLAST search

  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: 1)

TEST RUN: To have an idea about what is corrected and what is not

  --test-run            Just do a test run and nothing more. (default: False)

anvi-script-gen-CPR-classifier

Train a classifier for CPR prediction

Usage

usage: anvi-script-gen-CPR-classifier [-h] [-o CLASSIFIER_FILE] MATRIX_FILE

Parameters

positional arguments:

  MATRIX_FILE           TAB-delimited matrix of CPR genome names, classes, and
                        presence absence of single-copy genes. Headers of the
                        first two rows should be "genome", and "class". The
                        rest of the rows shold be single-copy genes.

optional arguments:

  -h, --help            show this help message and exit
  -o CLASSIFIER_FILE, --output CLASSIFIER_FILE
                        Output file name for the classifier. (default: cpr-
                        scg.classifier)

anvi-script-gen-distribution-of-genes-in-a-bin

Quantify the detection of genes in genomes in metagenomes to identify the environmental core. This is a helper script for anvi'o metapangenomic workflow

Example uses and other resources

Usage

usage: anvi-script-gen-distribution-of-genes-in-a-bin [-h] -c CONTIGS_DB
                                                      [-p PROFILE_DB]
                                                      [-C COLLECTION_NAME]
                                                      [-b BIN_NAME]
                                                      [--min-detection FLOAT]
                                                      [--fraction-of-median-coverage FLOAT]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  -b BIN_NAME, --bin-id BIN_NAME
                        Bin name you are interested in. (default: None)
  --min-detection FLOAT
                        For this entire thing to work, the genome you are
                        focusing on should be detected in at least one
                        metagenome. If that is not the case, it would mean
                        that you do not have any sample that represents the
                        niche for this organism (or you do not have enough
                        depth of coverage) to investigate the detection of
                        genes in the environment. By default, this script
                        requires at least '0.5' of the genome to be detected
                        in at least one metagenome. This parameter allows you
                        to change that. 0 would mean no detection test
                        required, 1 would mean the entire genome must be
                        detected. (default: 0.5)
  --fraction-of-median-coverage FLOAT
                        The value set here will be used to remove a gene if
                        its total coverage across environments is less than
                        the median coverage of all genes multiplied by this
                        value. The default is 0.25, which means, if the median
                        total coverage of all genes across all samples is
                        100X, then, a gene with a total coverage of less than
                        25X across all samples will be assumed not a part of
                        the 'environmental core'. (default: 0.25)

anvi-script-gen-functions-per-group-stats-output

Generate a TAB delimited file for the distribution of functions across groups of genomes/metagenomes

Usage

usage: anvi-script-gen-functions-per-group-stats-output [-h] [-i FILE_PATH]
                                                        [-e FILE_PATH]
                                                        [-g GENOMES_STORAGE]
                                                        --annotation-source
                                                        SOURCE NAME
                                                        [--aggregate-based-on-accession]
                                                        [--aggregate-using-all-hits]
                                                        [--min-occurrence NUM GENOMES]
                                                        [-G TEXT_FILE]
                                                        [-o FILE_PATH]

Parameters

GENOMES: Tell anvi'o where your genomes are.

  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)
  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file (default: None)

FUNCTIONS: Tell anvi'o which functional annotation source you like above all, and other important details you like about your analysis.

  --annotation-source SOURCE NAME
                        Get functional annotations for a specific annotation
                        source. You can use the flag '--list-annotation-
                        sources' to learn about what sources are available.
                        (default: None)
  --aggregate-based-on-accession
                        This is important. When anvi'o aggregates functions
                        for functional enrichment analyses or to display them,
                        it uses by default the 'function text' as keys. This
                        is because multiple accession IDs in various databases
                        may correspond to the same function, and when you are
                        doing a functional enrichment analysis, you most
                        likely would like to avoid over-splitting of functions
                        due to this. But then how can we know if you are doing
                        something that requires things to be aggregated based
                        on accession ids for functions rather than actual
                        functions? We can't. But we have this flag here so you
                        can instruct anvi'o to listen to you and not to us.
                        (default: False)
  --aggregate-using-all-hits
                        This program will aggregate functions based on best
                        hits only, and this flag will change that behavior. In
                        some cases a gene may be annotated with multiple
                        functions. This is a decision often made at the level
                        of function annotation tool. For instance, when you
                        run `anvi-run-ncbi-cogs`, you may end up having two
                        COG annotations for a single gene because the gene hit
                        both of them with significance scores that were above
                        the default noise cutoff. While this can be useful
                        when one visualizes functions or works with an `anvi-
                        summarize` output where things should be most
                        comprehensive, having some genes annotated with
                        multiple functions and others with one function may
                        over-split them (since in this scenario a gene with
                        COGXXX and COGXXX;COGYYY would end up in different
                        bins). Thus, when working on functional enrichment
                        analyses or displaying functions anvi'o will only use
                        the best hit for any gene that has multiple hits by
                        default. But you can turn that behavior off explicitly
                        and show anvi'o who is the boss by using this flag.
                        (default: False)
  --min-occurrence NUM GENOMES
                        The minimum number of occurrence of any given function
                        accross genomes. If you set a value, those functions
                        that occur in less number of genomes will be excluded.
                        (default: 1)

GROUPS: How should anvi'o divide your genomes into groups?

  -G TEXT_FILE, --groups-txt TEXT_FILE
                        A tab-delimited text file specifying which group each
                        item belongs to. Depending on the context, items here
                        may be individual samples or genomes. The first column
                        must contain item names matching to those that are in
                        your input data. A different column should have the
                        header 'group' and contain the group name for each
                        item. Each item should be associated with a single
                        group. It is always a good idea to define groups using
                        single words without any fancy characters. For
                        instance, `HIGH_TEMPERATURE` or `LOW_FITNESS` are good
                        group names. `my group #1` or `IS-THIS-OK?`, are not
                        good group names. (default: None)

OUTPUT: A.k.a., what you're really here for

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-script-gen-genomes-file

Generate an external genomes or internal genomes file

Usage

usage: anvi-script-gen-genomes-file [-h] [--input-dir DIR_PATH]
                                    [--include-subdirs] [-c CONTIGS_DB]
                                    [-p PROFILE_DB] [-C COLLECTION_NAME]
                                    [-o FILE_PATH]

Parameters

EXTERNAL GENOMES: Provide a directory, and anvi'o will provide an external genomes file containing all contigs dbs in that directory.

  --input-dir DIR_PATH  Directory path for input files (default: None)
  --include-subdirs     Also search subdirectories for files. (default: False)

INTERNAL GENOMES: Provide a contigs db, profile db, and collection name and anvi'o will bestow upon you an internal genomes file for that collection.

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)

OUTPUT: Path for your internal or external genomes file

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-script-gen-help-pages

Generate a static web site for anvi'o help pages

Usage

usage: anvi-script-gen-help-pages [-h] [-o DIR_PATH]

Parameters

optional arguments:

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)

anvi-script-gen-hmm-hits-matrix-across-genomes

A simple script to generate a TAB-delimited file that reports the frequency of HMM hits for a given HMM source across contigs databases

Usage

usage: anvi-script-gen-hmm-hits-matrix-across-genomes [-h] [-e FILE_PATH]
                                                      [-i FILE_PATH]
                                                      [--hmm-source SOURCE NAME]
                                                      [-l] -o FILE_PATH

Parameters

INPUT: INTERNAL/EXTERNAL GENOMES FILE: Yes. You need to use an internal and/or external genomes file to tell anvi'o where your contigs databases are.

  -e FILE_PATH, --external-genomes FILE_PATH
                        A two-column TAB-delimited flat text file that lists
                        anvi'o contigs databases. The first item in the header
                        line should read 'name', and the second should read
                        'contigs_db_path'. Each line in the file should
                        describe a single entry, where the first column is the
                        name of the genome (or MAG), and the second column is
                        the anvi'o contigs database generated for this genome.
                        (default: None)
  -i FILE_PATH, --internal-genomes FILE_PATH
                        A five-column TAB-delimited flat text file. The header
                        line must contain these columns: 'name', 'bin_id',
                        'collection_id', 'profile_db_path', 'contigs_db_path'.
                        Each line should list a single entry, where 'name' can
                        be any name to describe the anvi'o bin identified as
                        'bin_id' that is stored in a collection. (default:
                        None)

HMM STUFF: This is where you can specify an HMM source, and/or a list of genes to filter your results.

  --hmm-source SOURCE NAME
                        Use a specific HMM source. You can use '--list-hmm-
                        sources' flag to see a list of available resources.
                        The default is 'None'.
  -l, --list-hmm-sources
                        List available HMM sources in the contigs database and
                        quit. (default: False)

OUTPUTTAH:

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-script-gen-programs-network

Generate a network of anvi'o programs

Usage

usage: anvi-script-gen-programs-network [-h] [-o FILE_PATH]
                                        [-p PROGRAM_NAMES_TO_FOCUS]

Parameters

optional arguments:

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: NETWORK.json)
  -p PROGRAM_NAMES_TO_FOCUS, --program-names-to-focus PROGRAM_NAMES_TO_FOCUS
                        Comma-spearated list of program names to focus Mostly
                        for debugging purposes. (default: None)

anvi-script-gen-programs-vignette

Generate a markdown summary (vignette) of anvi'o programs

Usage

usage: anvi-script-gen-programs-vignette [-h] [-o FILE_PATH]
                                         [-p PROGRAM_NAMES_TO_FOCUS]

Parameters

optional arguments:

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: vignette-out.md)
  -p PROGRAM_NAMES_TO_FOCUS, --program-names-to-focus PROGRAM_NAMES_TO_FOCUS
                        Comma-spearated list of program names to focus Mostly
                        for debugging purposes. (default: None)

anvi-script-gen-pseudo-paired-reads-from-fastq

A script that takes a FASTQ file that is not paired-end (i.e., R1 alone) and converts it into two FASTQ files that are paired-end (i.e., R1 and R2). This is a quick-and-dirty workaround that halves each read from the original FASTQ and puts one half in the FASTQ file for R1 and puts the reverse-complement of the second half in the FASTQ file for R2. If you've ended up here, things have clearly not gone very well for you, and Evan, who battled similar battles and ended up implementing this solution wholeheartedly sympathizes

Usage

usage: anvi-script-gen-pseudo-paired-reads-from-fastq [-h] -f FASTQ
                                                      [-O FILENAME_PREFIX]

Parameters

optional arguments:

  -f FASTQ, --fastq FASTQ
  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        If you want final FASTQs with the format
                        myfastq_1.fastq and myfastq_2.fastq, then this
                        parameter should be set to myfastq (default: None)

anvi-script-gen-scg-domain-classifier

Train a classifier for SCG domain prediction

Usage

usage: anvi-script-gen-scg-domain-classifier [-h] [--genomes-dir GENOMES_DIR]
                                             [-o PATH]

Parameters

optional arguments:

  --genomes-dir GENOMES_DIR
                        This should be a directory that contains a directory
                        per domain for single-copy core gene collections a
                        given version of anvi'o knows about. For instance, if
                        there are collections for archaea, bacteria, and
                        eukarya, then this directory should contain
                        subdirectories with these names. Contents of which
                        should be contigs databases that belong to those
                        domains. These genomes will be used to generate the
                        classifier. (default: None)
  -o PATH, --output PATH
                        Output file name for the classifier. (default:
                        domain.classifier)

anvi-script-gen-short-reads

Generate short reads from contigs. Useful to reconstruct mock data sets from already assembled contigs

Usage

usage: anvi-script-gen-short-reads [-h] [--output-file-path FASTA_FILE]
                                   CONFIG_FILE

Parameters

positional arguments:

  CONFIG_FILE           Configuration file

optional arguments:

  -h, --help            show this help message and exit
  --output-file-path FASTA_FILE
                        Output FASTA file path (default: None)

anvi-script-gen_stats_for_single_copy_genes.sh

Usage

usage: anvi-gen-contigs-database [-h] -f FASTA [-n PROJECT_NAME]
                                 [-T NUM_THREADS] [-o DB_FILE_PATH]
                                 [--db-variant VARIANT]
                                 [--description TEXT_FILE] [-L INT]
                                 [--skip-mindful-splitting] [-K INT]
                                 [--skip-gene-calling]
                                 [--prodigal-translation-table INT]
                                 [--external-gene-calls GENE-CALLS]
                                 [--ignore-internal-stop-codons]
                                 [--skip-predict-frame]
anvi-gen-contigs-database: error: argument -f/--contigs-fasta: expected one argument

Parameters

anvi-script-get-coverage-from-bam

Get nucleotide-level, contig-level, or bin-level coverage values from a BAM file very rapidly. For other anvi'o programs that are designed to profile BAM files, see anvi-profile and anvi-profile-blitz

Usage

usage: anvi-script-get-coverage-from-bam [-h] -b BAM_FILE [-c CONTIG_NAME]
                                         [-l CONTIGS_OF_INTEREST]
                                         [-C COLLECTION_TXT] -m
                                         {pos,contig,bin} -o OUTPUT
                                         [--skip-contigs-check]

Parameters

REQUIRED: Declare your BAM file here

  -b BAM_FILE, --bam-file BAM_FILE
                        Sorted and indexed BAM file to analyze. (default:
                        None)

OPTION #1: This is the first and simplest option. Provide a contig name

  -c CONTIG_NAME, --contig-name CONTIG_NAME
                        The name of a single contig (default: None)

OPTION #2: Use this to characterize coverage for a list of contigs

  -l CONTIGS_OF_INTEREST, --contigs-of-interest CONTIGS_OF_INTEREST
                        Provide here a file where each line is a contig name.
                        (default: None)

OPTION #3: Use this to characterize coverage for a collection of contig sets (bins)

  -C COLLECTION_TXT, --collection-txt COLLECTION_TXT
                        Provide a collection text file. The first column
                        should be contig names and the second column should be
                        the bin to which the contig belongs. If you have a
                        collection from a profile database, you can export it
                        in this format with anvi-export-collection. (default:
                        None)

METHOD: Do you want to report coverage at a nucleotide level? Contig averages? Bin averages? Pick the method here.

  -m {pos,contig,bin}, --method {pos,contig,bin}
                        If pos, each nucleotide position will be reported
                        (valid for OPTION #1, #2, #3). If contig, report
                        contains contig averages (valid for OPTION #2, #3). If
                        bin, report contains bin averages (valid for OPTION
                        #3). (default: None)

OUTPUT: Your output file is decided here. Keep in mind if you use –method pos, this file will contain as many lines as there are nucleotides defined by your input option

  -o OUTPUT, --output OUTPUT
                        Output tab-delimited file path. Will overwrite
                        existing files. (default: None)

EXTRAS: All the misfits

  --skip-contigs-check  Checking to see that your collection text or contigs
                        of interest file has correct names can take a really
                        long time if you have a large enough number of
                        contigs. Use this flag to forego checking, and find
                        out the hard way. (default: False)

anvi-script-get-hmm-hits-per-gene-call

A simple script to generate a TAB-delimited file gene caller IDs and their HMM hits for a given HMM source

Usage

usage: anvi-script-get-hmm-hits-per-gene-call [-h] -c CONTIGS_DB
                                              [--hmm-source SOURCE NAME] -o
                                              FILE_PATH

Parameters

INPUT: ANVI'O CONTIGS DB:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)

INPUT: HMM SOURCE:

  --hmm-source SOURCE NAME
                        Use a specific HMM source. You can use '--list-hmm-
                        sources' flag to see a list of available resources.
                        The default is 'None'.

OUTPUTTAH:

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)

anvi-script-get-primer-matches

You provide this program with FASTQ files for one or more samples AND one or more short sequences, and it collects reads from FASTQ files that matches to your sequences. This tool can be most powerful if you want to collect all short reads from one or more metagenomes that are downstream to a known sequence. Using the comprehensive output files you can analyze the diversity of seuqences visually, manually, or using established strategies such as oligotyping.

Usage

usage: anvi-script-get-primer-matches [-h] [--samples-txt FILE]
                                      --primer-sequences FILE [-m INT]
                                      [--report-raw] [--stop-after INT]
                                      [-o DIR_PATH]

Parameters

INPUT FILES: Here you are expected to declare your FASTQ files and sequences which you are interested to find in those FASTQ files. Each file should have at least one entry

  --samples-txt FILE    A TAB-delimited file with columns ['sample', 'r1',
                        'r2'] or ['sample', 'group', 'r1', 'r2'] where `r1`
                        and `r2` columns are paths to compressed or flat FASTQ
                        files for each `sample` and `group` is an optional
                        column for relevant applications where samples are
                        affiliated with one-word categorical variables that
                        define to which group they are assigned. (default:
                        None)
  --primer-sequences FILE
                        A single-column file that contains one or more short
                        sequences to be searhed. (default: None)

PARAMETERS OF LIFE AND DEATH: Here you are expected to set appropriate paramters for your search (or you can choose to go with the defaults)

  -m INT, --min-remainder-length INT
                        Minimum length of the remainder of the read after a
                        match. If your short read is XXXMMMMMMYYYYYYYYYYYYYY,
                        where Ms indicate the primer sequence, min remainder
                        length is equal to the length of nucleotide matching
                        Y. Default is 60.
  --report-raw          Just report them sequences. Don't bother trimming.
                        (default: False)
  --stop-after INT      Stop after X number of hits because who needs data.
                        (default: 0)

OUTPUT: Tell anvi'o where to put your thingies

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)

anvi-script-merge-collections

Generate an additional data file from multiple collections

Usage

usage: anvi-script-merge-collections [-h] -c CONTIGS_DB -i FILES) [FILE(S ...]
                                     -o OUTPUT_FILE

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]
                        Input file(s). TAB-delimited input files should have
                        two columns, where the first column holds the contig
                        name, and the second one the bin id. This is the
                        standard ouptut of the program anvi-export-collection.
                        (default: None)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output file name. (default: None)

anvi-script-permute-trnaseq-seeds

This script generates a FASTA file of tRNA-seq seeds with permuted nucleotides at positions of predicted modification-induced substitutions. The underlying nucleotide without modification is not always the most common base call. The resulting FASTA file can be queried against a database of tRNA genes to validate nucleotides at modified positions and find the most similar sequences.

Usage

usage: anvi-script-permute-trnaseq-seeds [-h] -c CONTIGS_DB
                                         --specific-profile-db PROFILE_DB -f
                                         FASTA [-n FLOAT] [-x INT] [-W]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --specific-profile-db PROFILE_DB, -s PROFILE_DB
                        The path to an anvi'o profile database containing
                        specific coverage information on tRNA seeds. `anvi-
                        merge-trnaseq` generates a specific profile database
                        from a tRNA-seq experiment. (default: None)
  -f FASTA, --contigs-fasta FASTA
                        FASTA file to generate (default: None)
  -n FLOAT, --min-nt-frequency FLOAT
                        For a position in a contig, this is the minimum
                        nucleotide frequency, summed across all samples,
                        required for the nucleotide to be substituted in
                        permuted sequences. (default: 0.05)
  -x INT, --max-variable-positions INT
                        The maximum number of modified positions that can be
                        permuted at once in a contig. (default: 5)
  -W, --overwrite-output-destinations
                        Overwrite if the output files and/or directories
                        exist. (default: False)

anvi-script-pfam-accessions-to-hmms-directory

You give this program one or more PFAM accession ids, and it generates an anvi'o compatible HMM directory to be used with anvi-run-hmms

Usage

usage: anvi-script-pfam-accessions-to-hmms-directory [-h]
                                                     [--pfam-accessions-list PFAM_ACCESSION [PFAM_ACCESSION ...]]
                                                     [--pfam-accessions-file FILE]
                                                     [-O PATH]

Parameters

optional arguments:

  --pfam-accessions-list PFAM_ACCESSION [PFAM_ACCESSION ...]
                        One or more PFAM accession IDs (such as PF14437.6). If
                        you have multiple accessions, you can separate them
                        from each other with a space. If you have too many,
                        consider using the `--pfam-accessions-file` parameter
                        instead. (default: None)
  --pfam-accessions-file FILE
                        A single column text file where each column is a
                        single PFAM accession ID (such as PF14437.6). You may
                        have as many accession ids as you like in this file.
                        (default: None)
  -O PATH, --output-directory PATH
                        Output directory for the anvi'o-formatted HMMs. Choose
                        the name wisely as this will be the name that will
                        appear in the contigs database after you provide it
                        with `-H` flag to `anvi-run-hmms`. We suggest you to
                        use a name that does not include an of those millenial
                        characters (like space, question mark, comma, and
                        such, you know). (default: None)

anvi-script-predict-CPR-genomes

Screen for genomes to find likely members of CPR

Usage

usage: anvi-script-predict-CPR-genomes [-h] -c CONTIGS_DB [-p PROFILE_DB]
                                       [-C COLLECTION_NAME]
                                       [--list-collections]
                                       [--report-only-cpr]
                                       [--min-genome-size MIN_GENOME_SIZE]
                                       [--min-percent-completion MIN_PERCENT_COMPLETION]
                                       [--max-percent-redundancy MAX_PERCENT_REDUNDANCY]
                                       [--min-class-probability MIN_CLASS_PROBABILITY]
                                       [-o FILE_PATH] [--just-do-it]
                                       CLASSIFIER_FILE

Parameters

positional arguments:

  CLASSIFIER_FILE       Model output generated by anvi-script-gen-CPR-
                        classifier

optional arguments:

  -h, --help            show this help message and exit
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database (default: None)
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name. (default: None)
  --list-collections    Show available collections and exit. (default: False)
  --report-only-cpr     Include only bins that look like CPR genomes.
                        (default: False)
  --min-genome-size MIN_GENOME_SIZE
                        Minimum genome size to consider for CPR in Mbp.
                        Default is 0.500000
  --min-percent-completion MIN_PERCENT_COMPLETION
                        Minimum percent completion estimate based on anvi'o
                        default single-copy gene collections. Default is 50
  --max-percent-redundancy MAX_PERCENT_REDUNDANCY
                        Maxumum percent redundancy or single-copy genes in an
                        anvi'o bin, or a genome to consider for
                        classification. The default is 30
  --min-class-probability MIN_CLASS_PROBABILITY
                        If the classification confidence is below this don't
                        bother. Default is 75.
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)

anvi-script-process-genbank

This script takes a GenBank file, and outputs a FASTA file, as well as two additional TAB-delimited output files for external gene calls and gene functions that can be used with the programs anvi-gen-contigs-database and anvi-import-functions

Usage

usage: anvi-script-process-genbank [-h] -i GENBANK [-O FILENAME_PREFIX]
                                   [--output-fasta FASTA]
                                   [--output-gene-calls TAB DELIMITED FILE]
                                   [--output-functions TAB DELIMITED FILE]
                                   [--annotation-source ANNOTATION_SOURCE]
                                   [--annotation-version ANNOTATION_VERSION]
                                   [--omit-aa-sequences-column]

Parameters

INPUT: Give us the preciousss…

  -i GENBANK, --input-genbank GENBANK
                        Input GenBank file (default: None)

OUTPUT: You either provide a 'prefix', or provide specific output file names/paths. You can't mix the two (well, you can try).

  -O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX
                        A prefix to be used while naming the output files (no
                        file type extensions please; just a prefix). (default:
                        None)
  --output-fasta FASTA  Output FASTA file path. (default: None)
  --output-gene-calls TAB DELIMITED FILE
                        Output file path for external gene calls (default:
                        None)
  --output-functions TAB DELIMITED FILE
                        Output file path for anvi'o-importable gene functions
                        file (default: None)

DETAILS: Setting the annotation source and version data to appear in the output file for functional annotations file.

  --annotation-source ANNOTATION_SOURCE
                        Annotation source (default: "NCBI_PGAP") (default:
                        NCBI_PGAP)
  --annotation-version ANNOTATION_VERSION
                        Annotation source version to be stored in the database
                        (default: "v4.6") (default: v4.6)
  --omit-aa-sequences-column
                        If amino acid sequences for gene calls are present in
                        the GFF file, anvi'o by default will include an
                        `aa_sequences` column in the external gene calls
                        output file. This is good because then anvi-gen-
                        contigs-database can use that information directly,
                        instead of trying to predict the translated sequences
                        itself. If you would like this program to not report
                        amino acid sequences even when they present and use
                        anvi'o to predict translated sequences later, you can
                        use this flag. (default: False)

anvi-script-process-genbank-metadata

This script takes the 'metadata' output of the program ncbi-genome-download (see https://github.com/kblin/ncbi-genome-download for details), and processes each GenBank file found in the metadata file to generate a FASTA file, as well as genes and functions files for each entry. Plus, it autmatically generates a FASTA TXT file descriptor for anvi'o snakemake workflows. So it is a multi-talented program like that

Example uses and other resources

Usage

usage: anvi-script-process-genbank-metadata [-h] -m GENBANK_METADATA
                                            [-o DIR_PATH]
                                            [--output-fasta-txt OUTPUT_FASTA_TXT]
                                            [-E]

Parameters

INPUT: Give us the preciousss…

  -m GENBANK_METADATA, --metadata GENBANK_METADATA
                        This is the file you get from the program `ncbi-
                        genome-download` when you use the parameter
                        `--metadata-table`. (default: None)

OUTPUT: Where to find your precioussesss…

  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)
  --output-fasta-txt OUTPUT_FASTA_TXT
                        This is not a FASTA file, but a TAB-delimited file
                        with all the file names and paths processed by this
                        program. This output can directly go into the anvi'o
                        snakemake workflows because magic. (default: None)

ADDITIONAL PARAMETERS: Additional things you can set.

  -E, --exclude-gene-calls-from-fasta-txt
                        This flag will exclude the external gene calls and
                        functions from the fasta.txt file. Files for external
                        gene calls and functions according to the information
                        stored in GenBank file, but they will simply not be
                        included in your fasta.txt file. By doing so you will
                        *gurantee* that when you use this file from within a
                        workflow, anvi'o wil use its default gene caller to
                        identify genes. (default: False)

anvi-script-reformat-fasta

Reformat FASTA file (remove contigs based on length, or based on a given list of deflines, and/or generate an output with simpler names)

Usage

usage: anvi-script-reformat-fasta [-h] [-l MIN_LENGTH]
                                  [--max-percentage-gaps PERCENTAGE]
                                  [-M MAX_GAPS] [-i TXT FILE]
                                  [--export-gap-counts-table TSV FILE]
                                  [-I TXT FILE] -o FASTA FILE
                                  [--simplify-names] [--prefix PREFIX]
                                  [--seq-type SEQ TYPE] [-r REPORT FILE]
                                  FASTA FILE

Parameters

positional arguments:

  FASTA FILE

optional arguments:

  -h, --help            show this help message and exit
  -l MIN_LENGTH, --min-len MIN_LENGTH
                        Minimum length of contigs to keep (contigs shorter
                        than this value will not be included in the output
                        file). The default is 0, so nothing will be removed if
                        you do not declare a minimum size.
  --max-percentage-gaps PERCENTAGE
                        Maximum fraction of gaps in a sequence (any sequence
                        with more gaps will be removed from the output FASTA
                        file). The default is 100.000000.
  -M MAX_GAPS, --max-gaps MAX_GAPS
                        Maximum amount of gaps allowed per sequence in the
                        alignment. Don't know which threshold to pick? Use
                        --export-gap-counts-table to explore the gap counts
                        per sequence distribution! (default: 1000000)
  -i TXT FILE, --exclude-ids TXT FILE
                        IDs to remove from the FASTA file. You cannot provide
                        both --keep-ids and --exclude-ids. (default: None)
  --export-gap-counts-table TSV FILE
                        Export a table with the number of gaps per sequence.
                        Please provide a prefix to name the file. (default:
                        None)
  -I TXT FILE, --keep-ids TXT FILE
                        If provided, all IDs not in this file will be excluded
                        from the reformatted FASTA file. Any additional
                        filters (such as --min-len) will still be applied to
                        the IDs in this file. You cannot provide both
                        --exclude-ids and --keep-ids. (default: None)
  -o FASTA FILE, --output-file FASTA FILE
                        Output file path. (default: None)
  --simplify-names      Edit deflines to make sure they contigs have simple
                        names. (default: False)
  --prefix PREFIX       Use this parameter if you would like to add a prefix
                        to your contig names while simplifying them. The
                        prefix must be a single word (you can use underscor
                        character, but nothing more!). (default: None)
  --seq-type SEQ TYPE   Supply either 'NT' or 'AA' (if you want). If 'NT', any
                        characters besides {A,C,T,G} will by replaced with
                        'N'. If 'AA', any characters that are not 1-letter
                        amino acid characters will be replaced with 'X'. If
                        you don't supply anything, no charaters will be
                        modified. (default: None)
  -r REPORT FILE, --report-file REPORT FILE
                        Report file path. When you run this program with
                        `--simplify-names` flag, all changes to deflines will
                        be reported in this file in case you need to go back
                        to this information later. It is not mandatory to
                        declare one, but it is a very good idea to have it.
                        (default: None)

anvi-script-run-eggnog-mapper

Run eggnog-mapper on a contigs database, and store results

Usage

usage: anvi-script-run-eggnog-mapper [-h] -c CONTIGS_DB
                                     [--cog-data-dir COG_DATA_DIR]
                                     [-T NUM_THREADS]
                                     [--drop-previous-annotations]
                                     [--annotation EMAPPER_ANNOTATION_FILE]
                                     [--use-version EMAPPER_VERSION]

Parameters

optional arguments:

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs-database' (default: None)
  --cog-data-dir COG_DATA_DIR
                        The directory path for your COG setup if you did not
                        use the default directory. (default: None)
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading
                        whenever possible. Very conservatively, the default is
                        1. It is a good idea to not exceed the number of CPUs
                        / cores on your system. Plus, please be careful with
                        this option if you are running your commands on a SGE
                        --if you are clusterizing your runs, and asking for
                        multiple threads to use, you may deplete your
                        resources very fast. (default: None)
  --drop-previous-annotations
                        When declared, previous annotations in the database
                        will be dropped. (default: False)
  --annotation EMAPPER_ANNOTATION_FILE
                        If you have an annotation file from a previous run,
                        you can call this program to import the contents of
                        that file into the database instead of a run from
                        scratch. In that case, you must also use the `--use-
                        version` parameter to clarify which parser version
                        should be used to parse it. (default: None)
  --use-version EMAPPER_VERSION
                        The version of eggnog-mapper that generated the
                        annotation file. (default: 0.12.6)

anvi-script-snvs-to-interactive

Take the output of anvi-gen-variability-profile, prepare an output for interactive interface

Example uses and other resources

Usage

usage: anvi-script-snvs-to-interactive [-h]
                                       [--min-departure-from-consensus FLOAT]
                                       [--max-departure-from-consensus FLOAT]
                                       [--min-departure-from-reference FLOAT]
                                       [--max-departure-from-reference FLOAT]
                                       [--display-dep-from-reference]
                                       [--only-in-genes] [--random INTEGER]
                                       [--just-do-it] -o DIR_PATH
                                       VARIABILITY_PROFILE

Parameters

positional arguments:

  VARIABILITY_PROFILE   The output file generated by anvi-gen-variability-
                        profile

optional arguments:

  -h, --help            show this help message and exit
  --min-departure-from-consensus FLOAT
                        Minimum departure from consensus at a given variable
                        nucleotide position. The default is 0.00.
  --max-departure-from-consensus FLOAT
                        Maximum departure from consensus at a given variable
                        nucleotide position. The default is 1.00.
  --min-departure-from-reference FLOAT
                        Minimum departure from consensus at a given variable
                        nucleotide position. The default is 0.00.
  --max-departure-from-reference FLOAT
                        Maximum departure from consensus at a given variable
                        nucleotide position. The default is 1.00.
  --display-dep-from-reference
                        By default this program will generate a matrix file
                        that displays departure from consensus values. This
                        flag will switch to departure from reference.
                        (default: False)
  --only-in-genes       With this flag you will ignore SNVs in non-coding
                        regions. (default: False)
  --random INTEGER      Use this parameter to randomly subset your data. If
                        there are too many SNV positions, this script may take
                        forever to finish. You should *never* let it try to
                        deal with more than 25-30K points, but an ideal would
                        be around 4-5 thousand. (default: None)
  --just-do-it          Don't bother me with questions or warnings, just do
                        it. (default: False)
  -o DIR_PATH, --output-dir DIR_PATH
                        Directory path for output files (default: None)

anvi-script-tabulate

Tabulates TAB-delmited data with headers in terminal: cat table.txt | anvi-script-tabulate

Usage

usage: anvi-script-tabulate [-h] [--max-width MAX_WIDTH]

Parameters

optional arguments:

  --max-width MAX_WIDTH
                        Maximum number of characters to be displayed in the
                        output table. The default is 120 to make sure tables
                        will fit to most displays. Set to 0 to see the entire
                        table.

anvi-script-transpose-matrix

Transpose a TAB-delimited file

Usage

usage: anvi-script-transpose-matrix [-h] -o MATRIX_FILE MATRIX_FILE

Parameters

positional arguments:

  MATRIX_FILE           Input matrix.

optional arguments:

  -h, --help            show this help message and exit
  -o MATRIX_FILE, --output-file MATRIX_FILE
                        File path to store results. (default: None)

anvi-script-variability-to-vcf

A script to convert SNV output obtained from anvi-gen-variability-profile to the standard VCF format

Usage

usage: anvi-script-variability-to-vcf [-h] [-i FILE_PATH] [-o FILE_PATH]

Parameters

optional arguments:

  -i FILE_PATH, --input FILE_PATH
                        Filepath to the SNV table. This is the output from the
                        anvi-gen-variability-profile program with the
                        nucleotide engine (which is the default engine).
                        (default: None)
  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results. (default: None)