A program that computes functional enrichment within a pangenome..
This program computes functional enrichment within a pangenome and returns a functional-enrichment-txt file.
Please also see anvi-display-functions which can both calculate functional enrichment, AND give you an interactive interface to display the distribution of functions.
Enriched functions in a pangenome
For this to run, you must provide a pan-db and genomes-storage-db pair, as well as a misc-data-layers that associates genomes in your pan database with categorical data. The program will then find functions that are enriched in each group (i.e., functions that are associated with gene clusters that are characteristic of the genomes in that group).
Note that your genomes-storage-db must have at least one functional annotation source for this to work.
This analysis will help you identify functions that are associated with a specific group of genomes in a pangenome and determine the functional core of your pangenome. For example, in the Prochlorococcus pangenome (the one used in the pangenomics tutorial, where you can find more info about this program), this program finds that
Exonuclease VII is enriched in the
low-light genomes and not in
high-light genomes. The output file provides various statistics about how confident the program is in making this association.
How does it work?
What this program does can be broken down into three steps:
Determine groups of genomes. The program uses a misc-data-layers variable (containing categorical, not numerical, data) to split genomes in a pangenome into two or more groups. For example, in the pangenome tutorial, the categorical variable name was
lightthat partitioned genomes into
Determine the “functional associations” of gene clusters. In short, this is collecting the functional annotations for all of the genes in each cluster and assigning the one that appears most frequently to represent the entire cluster.
Quantify the distribution of functions in each group of genomes. For this, the program determines to what extent a particular function is enriched in specific groups of genomes and reports it as a functional-enrichment-txt file. It does so by running the script
Check out Alon’s behind the scenes post, which goes into a lot more detail.
Here is the simplest way to run this program:
The pan-db must contain at least one categorical data layer in misc-data-layers, and you must choose one of these categories to define your pan-groups with the
--category-variable parameter. You can see available variables with anvi-show-misc-data program with the parameters
-t layers --debug.
Note that by default any genomes not in a category will be ignored; you can instead include these in the analysis by using the flag
The genomes-storage-db must have at least one functional annotation source, and you must choose one of these sources with the
--annotation-source. If you do not know which functional annotation sources are available in your genomes-storage-db, you can use the
--list-annotation-sources parameter to find out.
By default, gene clusters with the same functional annotation will be merged. But if you provide the
--include-gc-identity-as-function parameter and set the annotation source to be ‘IDENTITY’, anvi’o will treat gene cluster names as functions and enable you to investigate enrichment of each gene cluster independently. This is how you do it:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source IDENTITY \ --include-gc-identity-as-function
To output a functional occurrence table, which describes the number of times each of your functional associations occurs in each genome you’re looking at, use the
--functional-occurrence-table-output parameter, like so:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source FUNCTION_SOURCE \ --functional-occurrence-table-output FUNC_OCCURRENCE.TXT
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.