Elaina Graham

Table of Contents

A note from the Meren Lab: We are very thankful to Elaina for sharing her expertise on behalf of all anvi’o users who wished to import KEGG annotations into their workflows. Elaina is currently a third-year PhD student at the University of Southern California, where she uses her background in microbiology and molecular biology, and her learnings from genome-resolved metagenomics, to generate testable hypotheses regarding the diversity and functioning of microbial populations. She is the developer of BinSanity, which is an automated binning algorithm that was recently used in another study she co-authored to recover more than two thousand metagenome-assembled genomes from the TARA Oceans Project.

Genome annotation is a key step in analyzing bioinformatic data, but with a variety of available databases it can be difficult to decide where to start. One useful database is the Kyoto Encyclopedia of Genes and Genomes (KEGG). KEGG integrates functional information, biological pathways, and sequence similarity. GhostKOALA is an automatic annotation tool to assign KEGG Identifiers to metagenomes. GhostKOALA is a web server though, so is it really worth it to annotate your contigs using this service over a program you can run locally, such as interproscan or the NCBI COGs database? Well one reason to use KEGG is that the database contains a variety of well described and environmentally important metabolic pathways with associated pathway maps available. Another is that GhostKOALA can handle metagenomes containing eukaryotes. KEGG is also widely used as a reference so KEGG identifiers can be easily cross linked in published literature. In my own experience using this workflow I found that making initial observations about the functional roles of the MAGs I generated was expedited because of the ability to quickly take KEGG Identifiers and import them into the BRITE mapping function of KEGG or using something like the KEGG Decoder to get a visualization of functional potential.

This tutorial walks you through annotating your contigs with GhostKOALA, and compiling the results into a format easily importable to your anvi’o contigs database using KEGG-to-anvio. You can download all of the associated scripts for this workflow here.

All you’ll need to run this tutorial is an installation of anvi’o, python, and internet access. In python you’ll need the pandas and BioPython modules. I was running Pandas v0.22.0, and BioPython v1.70, both of which are dependencies of anvi’o (hence you should have them if you have anvi’o).

So first lets download all of the scripts and files we’ll need for this tutorial. They are all on github so we can clone the repository as shown below:

 $ git clone https://github.com/edgraham/GhostKoalaParser.git

This will produce a directory containing all of the files and scripts we’ll use throughout this tutorial, which assumes that you already have an anvi’o contigs database (as explained in the metagenomic workflow tutorial).

Now that you have your contigs database, you may want to annotate your gene calls. There are so many ways to do this, but I am going to show you how to pull KEGG annotations into yours using GhostKOALA, which is KEGGs metagenome annotating service.

GhostKOALA is one of the few ways I have found to access KEGG gene calls for large datasets without our lab subscribing to use the downloadable KEGG database. But bfore I start I just wanted to recommend that everyone gives the paper a quick glance over as one thing you’ll notice is that GhostKOALA will not output bit scores or e-values, and it has a built-in way of determining whether a particular annotation is accurate or not.

Export Anvio Gene Calls

The first step to this is exporting amino acid sequences from your anvi’o contigs database.

 $ anvi-get-sequences-for-gene-calls -c CONTIGS.db \
                                     --get-aa-sequences \
                                     -o protein-sequences.fa

This will make sure anvi’o will be able associate annotations back to the genes described in the contigs database.

Run GhostKOALA

GhostKOALA does have an upload limit of 300MB. If your anvi’o database is rather large and you find that your protein-sequences.fa file is larger than that, you can split the file into multiple FASTA files, and concatenate the GhostKOALA outputs you will be generating during this step.

Before we run GhostKOALA we need to make a few modifications to our ‘protein-sequences.fa’ file. The parser that GhostKOALA uses for fasta sequences appears to have an unfriendly relationship with sequence id’s that begin with a number and will send you angry error messages if you try and submit the file directly exported from anvi’o which looks like the one below:


To remedy this, run this command on your FASTA file to add genecall_ prefix to every defline in your FASTA file:

 $ sed -i 's/>/>genecall_/' test.fa

Your FASTA file is now ready to be submitted to GhostKOALA for annotation.

To do this go to the GhostKOALA webserver, and click the “Choose File” button underneath the section that says Upload query amino acid sequences in FASTA format. From the menu you will upload your protein-sequences.fa, and will be asked to provide an email address.

You can only run one instance of GhostKOALA at a time per email address.

Before the run starts you’ll get an email that will look like this:

Your GhostKOALA job request
Query dataset: 1 entries
KEGG database to be searched: c_family_euk+genus_prok

Please click on the link below to either submit or cancel your job.

   (you will see your links here)

If no action is taken within 24 hours, your request will be deleted.

Click the link that says ‘Submit’, and your run will begin processing.

The KEGG parser script also has the option to combine your GhostKOALA results with interproscan. If you want to incorporate both annotations from interproscan and KEGG in the same table, follow the tutorial here to run interproscan, but run interproscan with the flags -f tsv, --goterms, --iprlookup, and --pathways.

Generate the KEGG orthology table

This step is optional and is here for anyone who is curious about how I generated the KEGG Orthology file :)

In the repository you cloned earlier there is a file called KO_Orthology_ko00001.txt.

If you take a peak at this file it looks like this:

Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K00844 HK; hexokinase [EC:]
Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K12407 GCK; glucokinase [EC:]
Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K00845 glk; glucokinase [EC:]
Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K00886 ppgK; polyphosphate glucokinase [EC:]
Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K08074 ADPGK; ADP-dependent glucokinase [EC:]
Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K00918 pfkC; ADP-dependent phosphofructokinase/glucokinase [EC:]
Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K01810 GPI, pgi; glucose-6-phosphate isomerase [EC:]
Metabolism Overview 01200 Carbon metabolism [PATH:ko01200] K06859 pgi1; glucose-6-phosphate isomerase, archaeal [EC:]
(…) (…) (…)  

I will now show you how to generate this file, which contains the necessary information to convert the KEGG Orthology assignments to functions.

Because the KEGG database is currently working under a subscription model, I had to find a workaround to access the information to match the orthologies with function. To do this you can run the following command (or click the Download htext link in this page) to download the htext file:

wget 'https://www.genome.jp/kegg-bin/download_htext?htext=ko00001&format=htext&filedir=' -O ko00001.keg

Once it is finished, you can use this (not-so-beautiful) code snippet to parse that file into the table above. It should work if you simply copy-paste it into your terminal and have the file ko00001.keg in your working directory:


while read -r prefix content
    case "$prefix" in A) col1="$content";; \
                      B) col2="$content" ;; \
                      C) col3="$content";; \
                      D) echo -e "$col1\t$col2\t$col3\t$content";;
done < <(sed '/^[#!+]/d;s/<[^>]*>//g;s/^./& /' < "$kegfile") > KO_Orthology_ko00001.txt

This simply goes through the hierarchical ‘.keg’ file you downloaded, and extracts different layers. The resulting output file should be a tab delimited file where the first column corresponds to the broadest classification, the fifth corresponds to the gene itself.

You may notice that some of the keg identifiers appear multiple times in this parsed folder. This is because many of the metabolism genes are constituents of multiple pathways (a piece of this puzzle that I have yet to determine an efficient way to incorporate into the anvi’o annotations).

Now that you know how I generated the KO_Orthology_ko00001.txt lets get back to the fun part and import our functions!

Parsing the results from GhostKOALA

Once GhostKOALA has finished running it will send you an email with a link to the results.

Download the annotation file, which will be called user_ko.txt, and make sure the results look like this:

 $ head user_ko.txt
 genecall_0       K01923
 genecall_2       K03611
 genecall_4       K01952
 genecall_5       K01693
 genecall_6       K00817
 genecall_7       K00013
 genecall_8       K00765

Now your file is ready to be imported!

For this, we will use the program KEGG-to-anvio (which is in the directory you clone from GitHub early on).

The help menu is shown below:

 $ python KEGG-to-anvio -h
 usage: KEGG-to-anvio [-h] [--KeggDB KEGGDB] [-i I]
                        [--interproscan INTERPROSCAN] [-o O]

Combines annotation Data for input to anvio

optional arguments:
  -h, --help            show this help message and exit
  --KeggDB KEGGDB       identify the Kegg Orthology file (modified from htext
                        using given bash script)
  -i I                  specify the file containing GhostKoala Results
  --interproscan INTERPROSCAN
                        interproscan results
  -o O                  Specify an output file

Now the next step it pretty simple. If you don’t have interproscan results, you can run it as:

$ python KEGG-to-anvio --KeggDB KO_Orthology_ko00001.txt \
                          -i user_ko.txt \
                          -o KeggAnnotations-AnviImportable.txt

if you have interproscan results, lets say in a output file called interproscan-results.txt (assuming interproscan was run with flags -f tsv, --goterms, --iprlookup, and --pathways), then you can run this command as:

$ python KEGG-to-anvio --KeggDB KO_Orthology_ko00001.txt \
                          -i user_ko.txt \
                          -o KeggAnnotations-AnviImportable.txt \
                          --interproscan interproscan-results.txt

Now you should have a file called KeggAnnotations-AnviImportable.txt in your work directory, which can be imported into anvi’o using the program anvi-import-functions!!

Here is how you can do it:

$ anvi-import-functions -c CONTIGS.db \
                        -i KeggAnnotations-AnviImportable.txt

Two birds one stone

If you decide you want to knock two birds out with one stone, you can also take the taxonomy data produced in GhostKOALA and convert that into an anvio importable format using the script GhostKOALA-taxonomy-to-anvio found in the GitHub repository you cloned.

The taxonomy file will download as a file called user.out.top. You can then run the parser like this:

python GhostKOALA-taxonomy-to-anvio user.out.top KeggTaxonomy.txt

Which then can be imported into your anvi’o contigs database this way:

anvi-import-taxonomy-for-genes -c CONTIGS.db \
                               -i KeggTaxonomy.txt \
                               -p default_matrix

You now now have KEGG functions in your anvi’o contigs database!

Hopefully (in the near future) we can find a better way to do this without us having to do all the convuluted steps with GhostKOALA.

Questions? Concerns? Find us on