Microbial 'omics


Brought to you by

Vibrio jascida pangenome: a mini workshop

A. Murat Eren (Meren)

Table of Contents

The goal of this mini workshop was to compute a Vibrio jascida pangenome using the isolate genomes that were generated during the Microbial Diversity course at the Marine Biological Laboratory in 2018. Essentially, we sat together with the course participants of the Microbial Diversity Course in July 2021 in Woods Hole, and went through the information on this page while discussing what we are seeing and did spontaneous investigations of the pangenome we have computed for Vibrio jascida. We are making this content publicly available in case others wish to play with these data, primarily because why not.

Thanks to the generosity of the Microbial Diversity Course at the MBL, all the novel isolate genomes this mini workshop will be using are now publicly available, and anyone will be able to follow this material on your own anvi’o-installed computers. But if you happen to do something that is much more serious than what is covered here, please consider reaching out to Rachel Whitaker before any publication to discuss the best way forward.

The following material starts with a very brief introduction to the isolates, and work its way through the steps of computing a pangenome using anvi’o. I hope this material will be useful not only to those of you who sat through the workshop at the MBL with me in-person, but also to those of you who run into this material later.

Please feel free to leave a comment at the end of the page or find us on Slack.

Questions? Concerns? Find us on

This tutorial is tailored for anvi’o v7.1 or later. You can learn the version of your installation by typing anvi-interactive -v. If you have an older version, some things may not work the way they should since we have a few new programs used in this tutorial.

Isolate genomes

Here we will analyze genomes of seven Vibrio jascida populations isolated from various Woods Hole places (including Eel Pond, Marine Resources Center, and the Great Harbor) using complete seawater medium from. This table offers a brief information for each isolate to get back to later:

Plate Num Researcher Source
12 Peggy Lai Seawater, Great Harbor, Woods Hole
13 Peggy Lai Seawater, Great Harbor, Woods Hole
14 Peggy Lai Seawater, Great Harbor, Woods Hole
47 Danielle Campbell Seawater, Eel Pond, Woods Hole
52 Brittni Bertolet ??
53 Sarah Schwenck Coral (from MRC), Woods Hole
55 Monica E. McCallum Coral (from MRC), Woods Hole

Finalizing the list of genomes

Before we start, we can already think about how to make our analysis a bit more inclusive. Since we are going to be analyzing genomes of a bunch of novel Vibrio jascida populations, we can include into our collection at least one reference genome to have an anchor in everyone else’s reality. Indeed, searching on the NCBI for Vibrio jascida does return some complete genomes:

Let’s download the first one:

Unzipping the compressed file NCBI sends resulted in a directory with a lot of things in it, but we simply want the FASTA file, and this is how I moved that FASTA file into my work directory where all the FASTA files for other Vibrio jascida isolates were living:

mv ~/Downloads/V_jasicida_NCBI/ncbi_dataset/data/GCF_002887615.1/GCF_002887615.1_ASM288761v1_genomic.fna ref_scaffolds.fasta

After this, I have the following FASTA files in my work directory:

ls *fasta

12_scaffolds.fasta
13_scaffolds.fasta
14_scaffolds.fasta
47_scaffolds.fasta
52_scaffolds.fasta
53_scaffolds.fasta
55_scaffolds.fasta
ref_scaffolds.fasta

OK. We are good to go.

Downloading the genomes to follow this workshop

You can download the genomes to follow this workshop:

# download the pack
curl -L https://ndownloader.figshare.com/files/28965090 -o V_jascida_genomes.tar.gz

# unpack it
tar -zxvf V_jascida_genomes.tar.gz

# go into the new directory
cd V_jascida_genomes

# take a look at the list of files to make sure they're identical
# to what I showed above
ls

Please remember these novel genomes are public thanks to the generosity of the Microbial Diversity Course at the MBL and its 2018 participants, Peggy Lai, Danielle Campbell, Brittni Bertolet, Sarah Schwenck, and Monica E. McCallum.

Moving on.

FASTA files to Contigs DBs

For a pangenomic analysis of these genomes we will use anvi’o, an open-source software platform for microbial ‘omics. And anvi’o requires some additional steps for us to be able to work with FASTA files :/

Most analyses in anvi’o start after the conversion of FASTA files into a special anvi’o database called contigs database. So the first order of business is to turn each of these FASTA files into a contigs database.

We can turn a FASTA file into an anvi’o contigs database by running the program anvi-gen-contigs-database. So that is what we need to do that for each one of the FASTA files we have one by one. But a better way to do such repetitive tasks is to put together a for loop using the scripting capabilities of the shell (which we access through our terminal).

To build a for loop, let’s first put our genome names into a text file:

ls *fasta | awk 'BEGIN{FS="_"}{print $1}' > genomes.txt

If you take a look at the contents of this file, you’ll see that it is nothing more than the following:

cat genomes.txt

12
13
14
47
52
53
55
ref

You will see how it will become useful in a second.

Finalizing the FASTA files

Before we go through converting each FASTA file into a contigs database, it is always a great idea to take a look at them to make sure they all make sense. What I mean by looking at them is literally looking at them in a text editor.

For instance, when you do that it becomes immediately clear that the FASTA files assembled from the sequencing of isolates contain many extremely short sequences. Since we are working with cultured organisms and expect high-quality assemblies, it is a good idea is to start by removing sequences that are too short since they may be coming from low-abundance contaminants that didn’t assemble well and/or influence how gene clusters are formed.

There are multiple ways to remove short sequences. But here we will go through each entry in our genomes.txt file, and use the program anvi-script-reformat-fasta create copies of each FASTA file without the sequences that are less than 2,500 nucleotides:

for g in `cat genomes.txt`
do
    echo
    echo "Working on $g ..."
    echo
    anvi-script-reformat-fasta ${g}_scaffolds.fasta \
                               --min-len 2500 \
                               --simplify-names \
                               -o ${g}_scaffolds_2.5K.fasta
done

Generating contigs databases

Now we have our final, anvi’o-compatible FASTA files, it is time to generate anvi’o contigs databases using the program anvi-gen-contigs-database:

for g in `cat genomes.txt`
do
    echo
    echo "Working on $g ..."
    echo
    anvi-gen-contigs-database -f ${g}_scaffolds_2.5K.fasta \
                              -o V_jascida_${g}.db \
                              --num-threads 4 \
                              -n V_jascida_${g}
done

Annotating contigs databases

An anvi’o contigs-db is a talented form of a FASTA file, and in addition to our genome sequences, it can hold a lot of additional information per genome. Such as gene calls, sequence k-mer frequencies, gene functions, and so on. Here we can use some relevant anvi’o programs, such as anvi-run-hmms, anvi-run-ncbi-cogs, anvi-scan-trnas, anvi-run-scg-taxonomy, to identify bacterial single-copy core genes, ribosomal RNAs, transfer RNAs among our contigs, and annotate our genes with functions (note that we are now going through each file that ends with ‘.db’, rather than entries in the genomes.txt):

for g in *.db
do
    anvi-run-hmms -c $g --num-threads 4
    anvi-run-ncbi-cogs -c $g --num-threads 4
    anvi-scan-trnas -c $g --num-threads 4
    anvi-run-scg-taxonomy -c $g --num-threads 4
done

Taking a quick look at genome stats

Since we have our contigs databases ready, let’s take a look at their simple features using the program anvi-display-contigs-stats:

anvi-display-contigs-stats *db

Running this pulls this display up on my browser:

There is a lot to discuss here, so let’s take a moment and try to make sense of this information here.

Creating an external genomes file

This is not necessarily a standard step of turning FASTA files to contigs databases, but when we need to work with a bunch of contigs databases in anvi’o, we use a special file called external-genomes to describe them all together in one place as a logically relevant group of contigs databases (in case you are curious, yes there also is an internal-genomes file format). So let’s create one here so we can use it for our downstream analyses.

The structure of the external-genomes file is simple: it is a two-column TAB-delimited file. One can easily create it using EXCEL or any kind of text editor. But anvi’o comes with a script, anvi-script-gen-genomes-file to create one from all the contigs databases found in a directory. So let’s use that to generate an external genomes file:

anvi-script-gen-genomes-file --input-dir . \
                             -o external-genomes.txt

You can take a look at the resulting file to see how simple it is.

Investigating contamination

In theory, everything is ready at this point for us to compute a pangenome. But when one is dealing with sequence data, nothing can ever be truly ready for anything in practice. I learned over the years that every time I look at things more carefully I find more things to improve. There is no end to it and we often have to call it a day and move on. But this is the last checkpoint before we turn our genomes into a pangenome, so it is a good idea to invest a little more time to see if there is something that would REALLY upset us later.

We can start by taking a quick look at the level of completion of our genomes using the program anvi-estimate-genome-completeness, which, like many anvi’o programs, accepts external-genomes as an input:

anvi-estimate-genome-completeness -e external-genomes.txt

And voila:

This output clearly shows that there is something fishy going on with V_jascida_52. It has more redundancy than others, it is larger in size than others, etc. This can be due to some sort of contamination, but how can we be sure? One way to do it is to take a more careful look at this genome by visualizing it through anvi-interactive, and perhaps remove some contigs from it if necessary.

Visualizing contigs for refinement

If you were to look at the help menu of anvi-interactive you would realize that this program requires an anvi’o profile database. Which we do not have, since we did not profile anything.

Profiling in anvi’o typically makes sense of read recruitment results, and we have not conducted any read recruitment here since we did not have access to the raw sequencing data of these isolates.

But while the program anvi-refine requires a BAM file as an input, we still can generate a blank profile without any mapping results using the flag --blank:

anvi-profile -c V_jascida_52.db \
             --sample-name V_jascida_52 \
             --output-dir V_jascida_52 \
             --blank

OK. Now we are ready to use anvi-interactive:

anvi-interactive -c V_jascida_52.db \
                 -p V_jascida_52/PROFILE.db

Which shows us this display:

The organization of contigs based on tetranucleotide frequency shows clear signs of contamination, but we are able to identify a set of contigs that we are relatively more confident to represent the true V. jascida genome we are interested in. In my opinion this is how it looks like:

We can even talk about what is contaminating:

So it is clear that we don’t want to include those contaminating genes in our analysis and we should only work the But how to get those sequences we marked as V_jascida_52_CLEAN. But how do we get those sequences?

Splitting a genome

Getting out the clean sequences from this will be much easier than you probably think, but it requires some introduction to some of the anvi’o lingo for which we don’t have time or space. That said, here is a very brief blurb: we have stored our bins into a collection called default using the store bin collection button in the previous screenshot (the collection name doesn’t really matter, we could call it anything). Having a collection in an anvi’o profile database (which is stored in our blank profile database in this case), enables us to access sequences in those bins in different ways. If you would like to have an idea, you can take a look at all the anvi’o programs that can work with a collection artifact.

One of those programs that can use a collection is the program anvi-split, which does something we really need: splits a contigs database into smaller ones each of which individually represent a single bin in a collection. So we can run it on our poor Vibrio jascida #52:

anvi-split -p V_jascida_52/PROFILE.db \
           -c V_jascida_52.db \
           -C default \
           -o V_jascida_52_SPLIT

The contigs-db for the ‘cleaned up’ V. jascida #52 is now here in my work directory:

V_jascida_52_SPLIT/V_jascida_52_CLEAN/CONTIGS.db

Updating the external genomes file

Now we have a better contigs database for V. jascida #52, we want to have a new external-genomes file where the old V. jascida #52 is replaced with the cleaner one. One quick way to do that is the following:

sed 's/V_jascida_52.db/V_jascida_52_SPLIT\/V_jascida_52_CLEAN\/CONTIGS.db/g' external-genomes.txt > external-genomes-final.txt

Which creates a new file called external-genomes-final.txt that we can use for downstream analyses. Before we move on, let’s take a look at the difference between the two sets of contigs databases using anvi-estimate-genome-completeness just to be sure:

anvi-estimate-genome-completeness -e external-genomes.txt
anvi-estimate-genome-completeness -e external-genomes-final.txt

Computing the pangenome

The first step of computing a pangenome is to generate a yet-another anvi’o artifact: genomes storage, which essentially merges all the contigs databases into a single, leaner file, so pangenomes can be made publicly available. The program anvi-gen-genomes-storage does that for us by taking in an external-genomes file:

anvi-gen-genomes-storage -e external-genomes-final.txt \
                         -o V_jascida-GENOMES.db

This step took 6 seconds on my laptop.

Then, we can use the program anvi-pan-genome to compute the pangenome:

anvi-pan-genome -g V_jascida-GENOMES.db \
                --project-name V_jascida \
                --num-threads 4

This step took 4 minutes on my laptop.

Displaying the pangenome

We can now visualize this pangenome the program anvi-display-genome and start discussing what we see.

anvi-display-pan -p V_jascida/V_jascida-PAN.db \
                 -g V_jascida-GENOMES.db

The command above gives us this ugly (yet interactive) display:

The unpolished pangenome

There are many things that can be done here, and covering all the features of the interface would be almost impossoble. But let’s do a few things to improve our understanding of the members of this pangenome. First, let’s change the background color of the reference genome so we see it clearly.

For that we first mark it in the ‘main panel’ so it is easy to see which genome it is:

Selecting the reference genome

And then change the background color before re-drawing it:

Changing the background color of a layer

When we redraw, this is what we see:

As you can see the way genomes are ordered in this display follows the alphabetical order. Gene clusters, however, don’t seem to care much about the way we named the genomes. The next thing we can do is to switch to a more meaningful order using the Layers panel:

Drawing again, gives us this:

Things are slowly coming together, but the display is just way too busy. We can remove all those additional layers for now and then turn them on for our discussions. First the additional data layers for items:

Then the additional data for layers:

OK. Looks much better.

Calculating average nucleotide identity between genomes

One question that display can not answer to what extent the associations between genomes are predicted by the average nucleotide identity between each genome. But we can easily recover that by using the program anvi-compute-genome-similarity to calculate ANI between genomes, and also add that information to the pangenome:

Before killing your server to run this command, you may want to store your state :)

anvi-compute-genome-similarity --external-genomes external-genomes-final.txt \
                               --program pyANI \
                               --output-dir ANI \
                               --num-threads 6 \
                               --pan-db V_jascida/V_jascida-PAN.db

The successful completion of this will update our pangenome.

Displaying ANI and adjusting it

We can re-display the same pangenome, now with ANI values in it, using anvi-display-genome:

anvi-display-pan -p V_jascida/V_jascida-PAN.db \
                 -g V_jascida-GENOMES.db

But to actually display ANI values, we need to turn on what we wish to display from the Layers tab:

Although clicking redraw may yield a rather disappointing discovery o_O

But do not despair. This is due to the fact that the default min-max values of ANI values are set to 70% to 100%, and our genomes way too closely related for those numbers to make sense. Since they resolve to the same ‘species’, 95% ANI is a much better cutoff than 70% to start with.

One can in fact quickly adjust these values:

98% it is. With that, things make a little more sense:

Studying the pangenome

Perhaps this is a good moment to stop and discuss what we are seeing in the light of how these isolates relate to one another:

Plate Num Researcher Source
12 Peggy Lai Seawater, Great Harbor, Woods Hole
13 Peggy Lai Seawater, Great Harbor, Woods Hole
14 Peggy Lai Seawater, Great Harbor, Woods Hole
47 Danielle Campbell Seawater, Eel Pond, Woods Hole
52 Brittni Bertolet ??
53 Sarah Schwenck Coral (from MRC), Woods Hole
55 Monica E. McCallum Coral (from MRC), Woods Hole

There are MANY ways to perform exploratory analyses or test hypotheses once one reaches to this stage. I would like everyone to take 30 minutes with the pangenome to find their own stories in it. But here are some quick tips that you may find beneficial.

Inspecting amino acid sequence alignments

Binning gene clusters

Changing how gene clusters are ordered

For instance, an ‘enforced synteny’ using the reference genome will change your display, where gene clusters will be ordered to respect the synteny of the genes they contain in the reference genome:

Selecting a range of gene clusters

Searching for functions

Searching using gene cluster filters

Storing a collection

Storing a collection is an important step since the concept of collection is important for anvi’o.

Splitting a pangenome

Do you remember the program anvi-split that we used to split a genome into pieces? Well, the same program can split a pangenome, too, because that’s how anvi’o rolls. If you were to be interested in focusing ONLY on the single-copy core gene clusters we stored as a collection in the previous gif, you could split it from the pan database into its standalone project.

If you don’t remember collections stored in your pangenome, you can always use the program anvi-show-collections-and-bins:

anvi-show-collections-and-bins -p V_jascida/V_jascida-PAN.db

Collection: "my_collection"
===============================================
Collection ID ................................: my_collection
Number of bins ...............................: 2
Number of splits described ...................: 5,131
Bin names ....................................: SCGs, Singletons

OK, so we want the SCGs bin from the collection my_collection. Here comes the anvi-split:

anvi-split -p V_jascida/V_jascida-PAN.db \
           -g V_jascida-GENOMES.db \
           --collection-name my_collection \
           --bin-id SCGs \
           --output-dir SPLIT

And here we go:

anvi-display-pan -p SPLIT/SCGs/PAN.db \
                 -g V_jascida-GENOMES.db

It is boring because everything is a single-copy core here. But there are indeed very interesting things to do. If we have time here during the workshop, Meren will show an extremely high-resolution phylogenomics analysis using subtly varying single-copy core genes.

Summarizing the pangenome

This effectively gives you a huge TAB-delimited file that describes everything you see on your pangenome. It is quite a critical output.

While I would strongly recommend you to use the anvi’o command line program anvi-summarize to generate a static output file for your pangenome, you can also do it through the interface:

Perhaps this is a good moment to take a look at the summary output, and open the summary of gene clusters in EXCEL or R.

Sharing the pangenome with others

The anvi-summarize step essentially generates an output directory with an index.html file that can be opened on any browser, and enables you to share your pangenome as a supplementary data item in any publication.

Another option, especially if you would like to feel like the year 2021 has finally arrived, you can shre your pangenome in a fully reproducible and interactive way. For that (1) create a new directory, (2) copy into this directory the two essential files, V_jascida/V_jascida-PAN.db and V_jascida-GENOMES.db, (3) compress the directory (i.e., tar/gz it), (4) upload it to FigShare, and (5) include the DOI for this item in your data availability section of your publication.


Thank you for your time! Please let me or others know if you have any suggestions!

Questions? Concerns? Find us on