Do cool stuff with gene clusters in anvi'o pan genomes.
This aptly-named program gets the sequences for the gene clusters stored in a pan-db and returns them as either a genes-fasta or a concatenated-gene-alignment-fasta (which you can use to run anvi-gen-phylogenomic-tree). This gives you advanced access to your gene clusters, which you can take out of anvi’o, use for phylogenomic analyses, or do whatever you please with.
While the number of parameters may seem daunting, many of the options just help you specify exactly which gene clusters you want to get the sequences from.
Running on all gene clusters
Here is a basic run, that will export alignments for every single gene cluster found in the pan-db as amino acid sequences :
To get the DNA sequences instead, just add
Exporting only specific gene clusters
Part 1: Choosing gene clusters by collection, bin, or name
You can export only the sequences for a specific collection or bin with the parameters
-b respectively. You also have the option to display the collections and bins available in your pan-db with
Alternatively, you can export the specific gene clusters by name, either by providing a single gene cluster ID or a file with one gene cluster ID per line. For example:
gene_clusters.txt contains the following:
GC_00000618 GC_00000643 GC_00000729
Part 2: Choosing gene clusters by their attributes
These parameters are used to exclude gene clusters that don’t reach certain thresholds and are applies on top of filters already applied (for example, you can use these to exclude clusters within a specific bin).
Here is a list of the different filters that you can use to exclude some subsection of your gene clusters:
- min/max number of genomes that the gene cluster occurs in.
- min/max number of genes from each genome. For example, you could exclude clusters that don’t appear in every genome 3 times, or get single-copy genes by setting
- min/max geometric homogenity index
- min/max functional homogenity index
- min/max combined homogenity index
For example, the following run on a genomes-storage-db that contains 50 genomes will report only the single-copy core genes with a functional homogenity index above 0.25:
anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ --max-num-genes-from-each-genome 1 \ --min-num-genomes-gene-cluster-occurs 50 \ --min-functional-homogenity-index 0.25
You can also exclude genomes that are missing some number of the gene clusters that you’re working with by using the paramter
For each of these parameters, see the program’s help menu for more information.
Fun with phylogenomics!
Here, you also have the option to specify a specific aligner (or list the available aligners), as well as provide a NEXUS formatted partition file, if you so choose.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.