Microbial 'omics

a post by Antti Karkman

Table of Contents

A note from the Meren Lab: We are very thankful to Antti on behalf of all anvi’o users who are also Prokka users for taking his time to share his experience with importing Prokka annotations into anvi’o.

In this tutorial I will walk you through the steps of annotating assembled contigs with functions using Prokka, and importing these annotations into your anvi’o contigs database using gff_parser.py, a script I implemented to make Prokka outputs compatible with anvi’o.

Setting the stage

This tutorial assumes that you have both anvi’o and Prokka installed and functional on your system, and you have finished assembling your short reads into contigs and stored in a file, say, contigs-raw.fa.

The following commands will install a Python library, and download a script to parse Prokka annotations later. Please run these two commands in your work directory:

 $ wget https://raw.githubusercontent.com/karkman/gff_parser/master/gff_parser.py -O gff_parser.py
 $ pip install gffutils

This remaining of this tutorial largely follows the metagenomic workflow tutorial, except for the functional annotation part.

But we will start from the beginning to make it more clear.

Reformat your contigs.fa

Please remember that you need to simplify your contig names prior to mapping.

Sometimes contig names coming from the assembler cause problems in anvi’o and you need to modify them with anvi-script-reformat-fasta:

 $ anvi-script-reformat-fasta contigs-raw.fa \
                              -o contigs.fa \
                              --min-len 0 \
                              --simplify-names

Now you have your contig names so that also anvi’o is happy. Then you can go ahead and map all your samples against them.

But this tutorial continues with the Prokka annotation step.

Run Prokka

It is likely that this script can be used to import GFF3 files into anvi’o, but I have only tested it with Prokka outputs. Please let me or the anvi’o developers know if you can confirm it works for GFF3 outputs of different software.

Run Prokka on your contigs:

prokka --prefix PROKKA \
       --outdir PROKKA \
       --cpus 2 \
       --metagenome contigs.fa

Now you will have a lot of outputs in the folder PROKKA. For gff_parser.py, you only need the PROKKA.gff file. In case of large metagenomes, the tbl2asn tool in Prokka might take forever and even crash. Even in cases where I killed the program when it got to that point, the GFF3 files were still fine.

Next step is the parsing.

Parsing the GFF3 file

From within your anvi’o virtual environment, you can run the following command line to generate the two output files, gene_calls.txt and gene_annot.txt:

 $ python gff_parser.py PROKKA/PROKKA.gff \
                         --gene-calls gene_calls.txt \
                         --annotation gene_annot.txt

One of the files will have the gene calls and the other the annotations for each gene call. You probably can guess which is which. The files will be named like this by default, so you actually don’t need to specify them. Or better, give them names you like more.

By default Prokka annotates also tRNAs, rRNAs and CRISPR regions. However, this script will only utilize open reading frames reported by Prodigal in the Prokka output. If you want the resulting output files to include everything, you can use the flag --process-all.

Generate a contigs database and import functions

Generate a new anvi’o contigs database from your assembled contigs, and use --external-gene-calls to import the gene calls from Prokka to your database:

 $ anvi-gen-contigs-database -f contigs.fa \
                             -o contigs.db \
                             --external-gene-calls gene_calls.txt

Then import the functional annotations:

 $ anvi-import-functions -c contigs.db \
                         -i gene_annot.txt

That should be it.

After this point, you can follow the rest of the metagenomic workflow using this contigs database.

Note for the pangenomics workflow

Since Prokka annotates more than just protein coding genes, these might cause problems in the pangenomics pipeline where everything is translated. That’s why while the gff_parser.py processes the output files generated by Prokka, only utilizes the open reading frames identified by Prodigal. Alternatively, you can use the flag --process-all to have the entire output processed, and generate your contigs database with the flag --ignore-internal-stop-codons. This will work, but there may be other issues downstream.

Modifying Prokka to include partial gene calls

You might have noticed that Prokka finds a lot less genes in your metagenome than the standard anvi’o pipeline. This is because of the hard-coded Prodigal option -c in Prokka, which only calls full-length genes (it probably is there for a reason, so proceed with caution).

But for metagenomes, you might except to have many partial gene calls, and might want to include them in your analyses. To achieve that, you can hack the Prokka script and remove the -c option from the Prodigal command. On Prokka 1.12 it’s on line 705. It might be wise to save the modified version of Prokka to a new file and use that when annotating stuff for anvi’o. It would also be important to mention this in the methods section in case you proceed to publish your analyses.