# Studying hypervariability in genomes through metagenomes

A. Murat Eren (Meren)
Blair G. Paul

Summary

The purpose of this tutorial is to demonstrate a bioinformatics workflow that enables the use of metagenomics to recognize hypervariable regions in microbial genomes and helps quantify and characterize the within-population sequence diversity of such regions for ecological and evolutionary surveys of naturally ocurring mirobial populations.

While this tutorial uses diversity-generating retroelements (DGRs) as an example and focuses on the activity of a particular DGR in a Trichodesmium genome, the strategies covered in this document are applicable to any other context in which rampant hypervariability or subtle microdiversity is of interest.

The tutorial also serves as a reproducible bioinformatics workflow for the recent review Blair Paul and I wrote on some of the most beautiful and mysterious ways for life to beat the boring means of evolution and skip ahead:

The organization of the document will follow the organization of Figure 3 in that work, and will demonstrate the key analysis steps that led to its emergence. We hope that this workflow will inspire you to keep in mind,

• The importance of studying metagenomic coverage data more carefully,
• The utility of single-nucleotide variants (SNVs) to recognize biological phenomena,
• The power of single-amino acid variants (SAAVs) to quantify non-synonymous variation,
• The ways to access unmapped or unmappable but critically relevant metagenomic short reads that originate from hypervariable regions through in silico primers,
• The application of oligotyping to metagenomics to investigate the nexus between ecology and hypervariability,
• The integrated platform anvi’o for similar novel investigations for which there are no off-the-shelf software options or canned solutions.

## Introduction

This tutorial will walk you through a high-resolution approach to study within-population sequence diversity of environmental microbes in the context of short genomic regions using metagenomics.

Microbiology is drowning in publicly available ‘omics data from any habitat. Yet, even though the computational strategies to study such data are most powerful in the hands of life scientists, it is still a challenge to take the initial steps into the world of ‘omics. While many off-the-shelf tools for metagenomics conveniently take in raw data and produce final outuputs, the ability to control what happens in between remains to be a luxury that only computational folk can afford. But some questions of microbiology require much more hands-on access and control over complex data than others. I believe sudying hypervariability or microdiversity are two of such questions where it is very difficult to get meaningful answers from automated approaches. The good news is that making sense of hypervariability using complex data sources without any expertise in computation AND without having to use any black-box automation is not impossible. This tutorial aims to serve as an example to that statement by walking you through a singular path as it dives into a simple question. I hope while going through this random walk you will also see the forest and recognize other potential directions that may take you to more relevant places for your own data analyses since computational strategies covered here will be most useful to life scientists who wish to have full control over their data to conduct meticulous and hands-on analyses.

If you wish to conduct similar analyses to those you will see in this document, all you need is a genome of your favorite organism, and some metagenomes in which you expect populations that match to your organism to be relatively abundant. If any of the following sections confuse you, please feel free to leave a comment down below or get in touch with Blair Paul or Meren to bounce ideas.

Most of the bioinformatics on this page will use anvi’o, a community-led, integrated, reproducible software platform for multi-omics. If you have everything you need and interested in using anvi’o, but you are not sure which anvi’o programs and artifacts would help you best conduct your investigation, please reach out to the anvi’o community via Slack:

As I mentioned, one of the major goals of this tutorial is to share the bioinformatics steps that led to the emergence of this figure,

in which,

• Panel A shows a sketch of a microbial genome, indicating the locations of several DGRs and target genes as well as a retron (purple triangles),
• Panel B visualizes the results of a metagenomic read recruitment analysis, and in particular the coverage of a particular genomic region that contains some genes of interest,
• Panel C details the drop in coverage and increase in the density of single-nucleotide variants (SNVs) in Target Gene 1,
• Panel D visualizes single-amino acid variants (SAAVs) in the same region, and finally,
• Panel E describes an oligotyping analysis of the metagenomic short reads that are recovered from the raw FASTQ files using an ‘in silico primer’, to discuss the ecological signal in this hypervariable region.

If you are here, it is more likely than not that you can name a genome of your interest and a resource for metagenomic data to investigate it. But I will assume that you may not be as familiar with the general strategies or concepts mentioned in the list above (i.e., metagenomics, SNVs, SAAVs, etc.), and I will try to detail them throughout. Please feel free to send an email if you think something deserves to be detailed further.

To demonstrate ways to study hypervariable genomic regions through metagenomics, Blair and I decided to use Trichodesmium erythraeum IMS101 as our genome and one of its diveristy-generating retroelements, and some marine metagenomes made available by the Tara Oceans Project to make sense of environmental populations of Trichodesmium erythraeum.

Diversity-generating retroelements (DGRs) was a good choice to study hypervariability, since they represent one of the most fascinating genetic mechanisms through which bacteria and archaea can diversity their proteins in astonishing speeds. Trichodesmium erythraeum was a good choice, too, thanks to the literature on this organism and other cyanobacteria by Ulrike Pfreundt et al (2014) as well as Alec Vallota-Eastman et al (2020). But as you can imagine, everything here will work as effectively if you were to use a Bacteroides fragilis genomes and some human gut metagenomes.

If you are interested in reproducing this tutorial, please consider installing anvi’o first.

First things first: the genome. We downloaded a genome for the T. erythraeum IMS101 strain through this page on NCBI. Just for posterity, here is the FASTA file that we ended up working with:

Second, we need metagenomes to chracterize environmental populations of T. erythraeum. So we downloaded the following six metagenomes from the collection of Tara Oceans Project metagenomes:

Tara Sample Station Region Lat Lon Depth (m)
TARA_N000000184 TARA_051 [IO] Indian Ocean -21,4593 54,3041 5,97
TARA_N000000202 TARA_051 [IO] Indian Ocean -21,5043 54,3563 5,97
TARA_N000000203 TARA_051 [IO] Indian Ocean -21,5043 54,3563 5,97
TARA_N000002356 TARA_131 [NPO] North Pacific Ocean 22,7091 -158,0247 5,88
TARA_N000002360 TARA_131 [NPO] North Pacific Ocean 22,7267 -157,9955 5,88
TARA_N000002364 TARA_131 [NPO] North Pacific Ocean 22,7511 -157,9632 5,88

Let’s first address why this particular six when there are many more marine metagenomes by Tara Oceans Project, and not all of them.

We chose this particular set because we initially determined in which Tara Oceans metagenomes T. erythraeum IMS101 had the highest coverage. Hypervariability is the enemy of coverage, and everything we will do downstream will work best if the coverage of a given genome is, on average, more than 50X in a given metagenome. This may be too conservative for some analyses, and one can survive with much less coverage, but as the coverage goes down, our confidence will go down with it.

To keep things simple, I will not go into the details of the initial steps of the analysis to identify which six metagenomes to use, but they are nothing more than doing read recruitment using all metagenomes, and then quickly profiling them, for which using a program like anvi-profile-blitz) can be extremely time-effective. But let’s forget about it for now, and talk about the next step. Finally, while I’m here, I thank Matt Schechter for his help downloading all the Tara Oceans Project metagenomes for the large size fraction!

If you are not familiar with metagenomic read recruitment, let’s start from here: it is the most essential first step for the vast majority of studies that make use of metagenomics. If you would like to familiarize yourself with terms and concepts in metagenomic read recruitment, you can watch this introductory video that will offer you a quick and easy introduction to its fundamentals:

In addition, you can go through this tutorial to have some idea about how key steps of actually performing metagenomic read recruitment looks or feels like.

But if you would like to follow this document a little longer, this is how my work directory looked like this after downloading the genome and metagenomes, which are all the files we need to perform a metagenomic read recruitment analysis.

ls T_erythraeum_IMS101.fa ERR1726621_1.fastq.gz ERR1726621_2.fastq.gz ERR1726854_1.fastq.gz ERR1726854_2.fastq.gz ERR1726640_1.fastq.gz ERR1726924_1.fastq.gz ERR1726640_2.fastq.gz ERR1726924_2.fastq.gz ERR1726697_1.fastq.gz ERR1726697_2.fastq.gz ERR1726823_1.fastq.gz ERR1726823_2.fastq.gz ERR1726921_1.fastq.gz ERR1726921_2.fastq.gz  Careful readers will notice that there are 7 pairs of R1 and R2 files for six metagenomes, rather than 12. This is the case in this particular instance because one of the metagenomes was sequenced multiple times, hence the two extra files. In theory what we will do with our T. erythraeum IMS101 genome and six Tara Oceans Project metagenomes is not any different at all than what is explained in the metagenomics read-recruitment tutorial and the steps covered in it. You may very well perform your own read recruitment analyses for your genomes and metagenomes following that tutorial alone, and get two primary anvi’o data products that we will use for everything downstream: contigs-db and profile-db. So, doing the read recruitment manually is the option one. But if you have hundreds of metagenomes, or if you want to use a high-performance computing system to do the job for you, doing these steps manually does not scale well (even though it certainly is the best to do at least a few manual metagenomic read recruitment analyses just to familiarize yourself with all the steps first). For this reason, and for the sake of reproducibility, our group exclusively uses anvi’o snakemake workflows for read recruitment analyses, which help us automatize all the rudimentary steps to get our anvi’o contigs-db and profile-db, magical files the utility of which are hopefully going to be clearer to you as you go through the rest of this document. So, doing the read recruitment automatically is the option two. If you were to go with the option two, and follow the tutorial for anvi’o snakemake workflows, you would realize that if you have a genome and a bunch of metagenomes to recruit reads from, we would need to use what is called the metagenomics workflow in reference mode (where the reference is the genome you already have, in contrast to the ‘default mode’ where anvi’o would first assemble your metagenomes and use the resulting contigs to do the read recruitment). The same tutorial would also tell you that to run the metagenomics workflow in reference mode, one would need a fasta-txt file and a samples-txt file to describe their data neatly. The fasta-txt for this analysis looked like this, name path T_erythraeum_IMS101 T_erythraeum_IMS101.fa and the samples-txt file looked like this: sample r1 r2 TARA_N000002360 ERR1726621_1.fastq.gz ERR1726621_2.fastq.gz TARA_N000000202 ERR1726854_1.fastq.gz ERR1726854_2.fastq.gz TARA_N000002356 ERR1726640_1.fastq.gz,ERR1726924_1.fastq.gz ERR1726640_2.fastq.gz,ERR1726924_2.fastq.gz TARA_N000000184 ERR1726697_1.fastq.gz ERR1726697_2.fastq.gz TARA_N000002364 ERR1726823_1.fastq.gz ERR1726823_2.fastq.gz TARA_N000000203 ERR1726921_1.fastq.gz ERR1726921_2.fastq.gz Once you have these two TAB-delimited files ready, running the metagenomics workflow is just a single command line using the program anvi-run-workflow. We did it, and made the resulting anvi’o databases publicly available at 10.6084/m9.figshare.17708678. ## Taking an initial look at the read recruitment results Regardless of the path you have followed, if you are here, this document assumes that you have your contigs-db and profile-db files ready. If you do not have them, but still would like to reproduce the rest of the steps in this document, you can download the files that we used for our analysis by typing the following commands in your terminal on your anvi’o installed computer: # download the reproducible data pack curl -L https://figshare.com/ndownloader/files/32412086 -o TRICHO-DGRs.tar.gz # unpack the data pack tar -zxvf TRICHO-DGRs.tar.gz # go into the new directory cd TRICHO-DGRs  The rest of this workflow will assume when you type ls in your work directory, you see the following files, which correspond to our copies of the contigs-db (the file named CONTIGS.db) and profile-db (the file named PROFILE.db): AUXILIARY-DATA.db CONTIGS.db PROFILE.db  The AUXILIARY-DATA.db file is also an anvi’o artifact, but essentially it is automatically generated by anvi-merge as an accessory (albeit a critical one) to the PROFILE.db, so we never mention it even though you also need to have that file in your work directory. If you have these files accessible to you, then you should be able to take a first look at the read recruitment results with anvi-interactive by running the following command, anvi-interactive -p PROFILE.db -c CONTIGS.db  which should open your browser to present you with this display: This is an anvi’o interactive interface, which provides a useful environment to explore read recruitment results. You can right-click on any item on shown on this page, and inspect the coverage of the genome across different metagenomes: Even blindly go through a few sections of the genome to explore how the coverage looks like (which is something not many people does) is a useful exercise and will make you start wondering about how to make sense of weird things you see. Most bioinformatics approaches summarize coverage values per contig into a single number, even though there is so much to learn from actually just looking at them, so please feel free to take a moment to appreciate the sheer complexity of environmental metagenomes. ## Looking for RTs in the genome Exploring coverage patterns can certainly help life scientists who know their organisms to recognize interesting phenomena, but at some point one would like to focus on the task at hand, which, in the context of our study, was to find smoking guns of hypervariability. In the case of DGRs, we know it is a good idea to start with retron-type reverse transcriptase genes. And when we have a contigs-db, searching for genes with particular functions, and highlighting the regions of the genome that contains genes with these functions can offer a good starting point to go deeper: The same can also be done via the command line using the program anvi-search-functions. Not all hits to the term “retron-type reverse transcriptase” will identify a gene that is a true RT, but browsing these small number of regions can help identify the true ones. For instance, we found two regions that contained true RTs, and focused on one of these, which we denoted as DGR2 throughout our review. In the GIF above, it is one of those regions highlighted as red, which corresponds to T_erythraeum_IMS101_000000000001_split_00132 in our anvi’o databases. When there is a particular region of interest, you can find it on the interface by searching for this name to right-click and inspect it, or you can use the program anvi-inspect to visualize it. For instance, once you know which region you are interested in, running the following command in your terminal on our project files, anvi-inspect -p PROFILE.db \ -c CONTIGS.db \ --split-name T_erythraeum_IMS101_000000000001_split_00132  will give you the following display with the RT: The region shown in Panel B of our Figure 3 is around relative nucleotide positions of 8300 and 13500, and if you zoom into that region, You will see the same region shown in Panel B. For trained eyes that have been looking at coverage data for a long time, the evidence for a DGR activity in this region is right there in this visualization. But let’s first take a look at this diagram that marks with tiny red boxes the locations of the template sequence the TR uses as well as the two DGR target sites in two genes: The alignment of sequences show that this is a text-book definition of DGR activity: sequences in both target sites differ from the template region (TR) only at adenosine nucleotides. In fact, it is possible to confirm it visually. First, copy the entire sequence that matches to this region from the interface, to search in a text editor the TR, which is this: GCGGCTCCTGGAACAACTATCCTAGGAGGTGTCGCTCTGCGAACCGCAACAACTATAACTCGGACGAGGCGGACAACAACAATATTGGTTTTCGTCTT  but since we are working with the reverse-complemented genes (see the direction of the RT), we should search for the reverse complement of the template, which is this: AAGACGAAAACCAATATTGTTGTTGTCCGCCTCGTCCGAGTTATAGTTGTTGCGGTTCGCAGAGCGACACCTCCTAGGATAGTTGTTCCAGGAGCCGC  There will be a single match, which will be the TR region, obviously: But if we replace every T (instead of an A since the TR is reverse-complemented) with an N, AAGACGAAAACCAA.A..G..G..G.CCGCC.CG.CCGAG..A.AG..G..GCGG..CGCAGAGCGACACC.CC.AGGA.AG..G..CCAGGAGCCGC  and search for this pattern instead, there will be multiple hits: Poetic! ## Visualizing coverage and single-nucleotide variants (SNVs) So far we have been focusing on the genome, which essentially captures the state of the DGR-mediated evolution of the VRs at the time of sequencing of it. But of course what is even more interesting is to look into what has been happening in the environment. For the sake of simplicity, from this point on we will focus only one of the VR regions: To be specific, this region corresponds to this sequence: >T_erythraeum_IMS101_000000000001_split_00132 range:8350,8827 TCCTGGGGGGGAAACTCACAAGACGAAAACCAACAAAGTCGTTGACCGCCTCGACCGAGAAATAGTAGCCGCGGAACGCAGAGCGACACCACCAAGGAAAGTCGTACCAGGAGCCGCCCC GTAATCTTTTTGTAGTATATTTTGTAGTATTTTGATTACTTGATACCCATGGACTACCATCATTAGGTGCGCCCTCGTAATTATCATGCGAGTCATCTTCGCACCATTCCCAGACATTCC CGTGCATATCATATAATCCAAAAAGATTAGCAACTTTGAAGTAACCAACTGGAGTGGTTTGTTCTCTATATTCTCCTTTTGGGCCTCTGCCGTAAGAGCCAGACGATTTTTTTTCCCCAT CATCCTTTCCACGATAGTTAGCGAGGTCTGTGGTAATAGTTTCCCCAAAATGAAAAGGCGTAGTTGTTCCTGCACGACAAGCATATTCCCATTGTGCTTCACTGGGTAGCTGATAGT  Black bars indicate the positions of nucleotides in the genome that disagrees with the otherwise-homologous sequences found in the environment. As we go into the variable region, the ‘mappability’ of reads start to decrease due to rapid increase in variation. This is an extremely straightforward way to recognize hypervariability that has emerged within the environmental populations. In addition to targeted analysis like this one (i.e., finding the RT, looking at upstream and downstream with expected outcomes), random walks throughout contigs may reveal new regions of potential interest. ## Working with single-amino acid variants (SAAVs) Variation in coverage and SNVs are quite effective indicators of within-population hypervariability in read recruitment results, however, SNVs are not sufficient indicators of change that influence amino acid composition. To address this we developed an algorithm to identify single-amino acid variants, or SAAVs, which we had first described in Delmont and Kiefl et al (2019)). Sadly, while anvi’o displays SNVs in the interactive interface, it does not display SAAVs. But we clearly managed to show SAAVs in our figure instead of SNVs, anyway. How? Anvi’o comes with one of the most powerful frameworks to study microbial population genetics (more info here), and enables extensive characterization of not only SNVs, but also single-codon variants (SCVs), as well as SAAVs. For instance, recovering SNVs, SCVs, or SAAVs of a gene, contig, or an entire genome in one or more metagenomes is possible via the program anvi-gen-variability-profile. Since we know our gene of interest, we can instruct the program to focus only on it after learning the gene caller id of the gene of interest from the interface: After this, we can run anvi-gen-variability-profile the following way, anvi-gen-variability-profile -c CONTIGS.db \ -p PROFILE.db \ --gene-caller-ids 1687 \ --samples-of-interest samples-of-interest.txt \ -o DGR2_1687_AA.txt \ --engine AA \ --kiefl-mode  The samples-of-interest.txt is simply a text file that contains a single line that says TARA_N000000184, since in this figure we only used one of the six metagenomes as an example. which would yield an output file that looks like this: entry_id unique_pos_identifier sample_id corresponding_gene_call codon_order_in_gene codon_number gene_length coverage Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro STP Ser Thr Trp Tyr Val reference consensus competing_aas departure_from_reference departure_from_consensus n2n1ratio entropy BLOSUM90 BLOSUM90_weighted BLOSUM62 BLOSUM62_weighted (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) 272 272 TARA_N000000184 1687 246 247 879 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 Lys Lys LysLys 0.0 0.0 0.0 0.0 6.0 5.0 48 48 TARA_N000000184 1687 247 248 879 139 2 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 121 Arg Val IleVal 1.0 0.12949640287769784 0.1322314049586777 0.43060076813672 3.0 2.489592760180997 3.0 2.6135746606334855 49 49 TARA_N000000184 1687 248 249 879 133 0 0 0 0 0 0 0 0 0 0 121 0 0 0 0 0 5 0 0 0 7 Leu Leu LeuVal 0.09022556390977443 0.09022556390977443 0.05785123966942149 0.3643399806821796 0.0 -1.2676529926025555 1.0 -0.29119031607262946 273 273 TARA_N000000184 1687 249 250 879 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arg Arg ArgArg 0.0 0.0 0.0 0.0 6.0 5.0 50 50 TARA_N000000184 1687 250 251 879 108 0 0 0 0 0 0 0 105 0 0 0 0 0 0 0 1 2 0 0 0 0 Gly Gly GlySer 0.02777777777777779 0.027777777777777776 0.01904761904761905 0.1446114944607026 -1.0 -1.0000000000000013 0.0 0.0 274 274 TARA_N000000184 1687 251 252 879 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Gly Gly GlyGly 0.0 0.0 0.0 0.0 6.0 6.0 275 275 TARA_N000000184 1687 252 253 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Ser Ser SerSer 0.0 0.0 0.0 0.0 5.0 4.0 276 276 TARA_N000000184 1687 253 254 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Trp Trp TrpTrp 0.0 0.0 0.0 0.0 11.0 11.0 51 51 TARA_N000000184 1687 254 255 879 67 0 0 5 4 1 0 0 0 0 1 5 0 0 22 0 0 1 0 0 28 0 Tyr Tyr PheTyr 0.582089552238806 0.582089552238806 0.7857142857142857 1.4741819204714253 3.0 -0.5564720812182741 3.0 0.048857868020304465 52 52 TARA_N000000184 1687 255 256 879 63 0 0 12 8 0 0 0 0 0 0 1 0 0 16 0 0 2 0 0 17 7 Asp Tyr PheTyr 0.873015873015873 0.7301587301587301 0.9411764705882353 1.6988787908328986 3.0 -2.0360531309297913 3.0 -1.0961416824794434 53 53 TARA_N000000184 1687 256 257 879 68 0 0 5 1 3 0 0 2 0 6 4 0 0 19 0 0 2 0 0 25 1 Phe Tyr PheTyr 0.7205882352941176 0.6323529411764706 0.76 1.7661608575859613 3.0 -1.1439864483342745 3.0 -0.3387916431394694 277 277 TARA_N000000184 1687 257 258 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Pro Pro ProPro 0.0 0.0 0.0 0.0 8.0 7.0 54 54 TARA_N000000184 1687 258 259 879 77 0 36 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 27 0 0 Trp Arg ArgTrp 0.6493506493506493 0.5324675324675324 0.75 1.0328823078127687 -4.0 -3.7281553398058254 -3.0 -2.5242718446601944 55 55 TARA_N000000184 1687 259 260 879 84 0 18 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 63 0 0 Trp Trp ArgTrp 0.25 0.25 0.2857142857142857 0.6648642241909106 -4.0 -3.96078431372549 -3.0 -2.823529411764706 278 278 TARA_N000000184 1687 260 261 879 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Cys Cys CysCys 0.0 0.0 0.0 0.0 9.0 9.0 279 279 TARA_N000000184 1687 261 262 879 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arg Arg ArgArg 0.0 0.0 0.0 0.0 6.0 5.0 280 280 TARA_N000000184 1687 262 263 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Ser Ser SerSer 0.0 0.0 0.0 0.0 5.0 4.0 281 281 TARA_N000000184 1687 263 264 879 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Ala Ala AlaAla 0.0 0.0 0.0 0.0 5.0 4.0 56 56 TARA_N000000184 1687 264 265 879 149 0 0 13 14 1 0 0 1 0 13 18 0 0 39 0 0 3 0 0 42 5 Phe Tyr PheTyr 0.738255033557047 0.7181208053691275 0.9285714285714286 1.870615888859036 3.0 -1.5474420153146153 3.0 -0.7085784041726777 282 282 TARA_N000000184 1687 265 266 879 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arg Arg ArgArg 0.0 0.0 0.0 0.0 6.0 5.0 57 57 TARA_N000000184 1687 266 267 879 183 0 2 27 39 0 0 0 16 1 6 22 0 0 21 0 0 20 0 0 23 6 Gly Asp AsnAsp 0.912568306010929 0.7868852459016393 0.6923076923076923 2.1325230668720696 1.0 -2.4990362109321214 1.0 -1.497865895635412 58 58 TARA_N000000184 1687 267 268 879 231 0 7 58 11 2 0 0 5 0 15 20 0 0 28 0 0 11 0 0 58 16 Tyr Asn AsnTyr 0.748917748917749 0.7489177489177489 1.0 2.044069292686376 -3.0 -2.2779967747715464 -2.0 -1.3312130442573014 59 59 TARA_N000000184 1687 268 269 879 253 0 0 0 1 10 0 0 0 0 0 0 0 0 29 0 0 18 0 1 194 0 Tyr Tyr PheTyr 0.23320158102766797 0.233201581027668 0.14948453608247422 0.8113791942612562 3.0 -0.38819405719748273 3.0 0.3066995937226163 60 60 TARA_N000000184 1687 269 270 879 287 2 0 11 4 1 0 0 1 2 26 46 0 0 84 0 0 7 0 0 79 24 Phe Phe PheTyr 0.7073170731707317 0.7073170731707317 0.9404761904761905 1.8170003941079511 3.0 -0.7784392745924161 3.0 -0.04326189167735232 61 61 TARA_N000000184 1687 270 271 879 326 0 0 0 0 0 0 0 0 0 0 4 0 0 3 1 0 318 0 0 0 0 Ser Ser LeuSer 0.024539877300613466 0.024539877300613498 0.012578616352201259 0.1391263567207563 -3.0 -2.864611783066717 -2.0 -1.870464299648848 62 62 TARA_N000000184 1687 271 272 879 372 5 0 2 57 0 0 0 16 0 2 4 0 0 3 0 0 0 0 0 0 283 Val Val AspVal 0.239247311827957 0.239247311827957 0.20141342756183744 0.8325024390840705 -5.0 -4.1310750566396255 -3.0 -2.392567419425565 63 63 TARA_N000000184 1687 272 273 879 416 1 0 0 1 0 1 368 9 1 0 0 2 1 0 0 0 0 0 0 0 32 Glu Glu GluVal 0.11538461538461542 0.11538461538461539 0.08695652173913043 0.4868405831148602 -3.0 -2.640725121857714 -2.0 -1.6656991072895562 64 64 TARA_N000000184 1687 273 274 879 444 163 0 0 1 0 0 1 1 0 0 3 1 11 0 2 0 67 186 1 0 7 Ala Thr AlaThr 0.6328828828828829 0.581081081081081 0.8763440860215054 1.3015342024938332 0.0 0.06597407345347082 0.0 0.19405304126616596 65 65 TARA_N000000184 1687 274 275 879 464 9 0 6 178 0 0 3 46 0 0 2 0 0 11 0 0 0 3 0 1 205 Val Val AspVal 0.5581896551724138 0.5581896551724138 0.8682926829268293 1.2808922763228758 -5.0 -4.0563896271819555 -3.0 -2.380892177286114 66 66 TARA_N000000184 1687 275 276 879 513 0 8 167 88 0 0 2 10 8 41 22 1 0 26 0 0 47 10 0 69 14 Asn Asn AsnAsp 0.6744639376218324 0.6744639376218323 0.5269461077844312 2.0600679413792027 1.0 -2.0710786570655073 1.0 -1.1917346544135248 67 67 TARA_N000000184 1687 276 277 879 546 1 0 157 174 0 0 0 13 2 13 23 0 0 20 2 0 39 2 1 73 26 Asp Asp AsnAsp 0.6813186813186813 0.6813186813186813 0.9022988505747126 1.8425864744458764 1.0 -2.028227182742849 1.0 -1.1755409925592188 68 68 TARA_N000000184 1687 277 278 879 591 12 10 81 66 4 0 0 9 3 65 61 0 0 113 3 0 31 7 0 87 39 Phe Phe PheTyr 0.8087986463620982 0.8087986463620981 0.7699115044247787 2.278592016934293 3.0 -2.1815553891728245 3.0 -1.244265763859501 69 69 TARA_N000000184 1687 278 279 879 671 0 0 0 0 0 0 0 0 0 445 17 0 14 137 0 0 0 0 0 0 58 Val Ile IlePhe 0.9135618479880775 0.33681073025335323 0.30786516853932583 0.9822451928916645 -1.0 0.11259679962369017 0.0 0.8111394698560093 283 283 TARA_N000000184 1687 279 280 879 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Gly Gly GlyGly 0.0 0.0 0.0 0.0 6.0 6.0 284 284 TARA_N000000184 1687 280 281 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Phe Phe PhePhe 0.0 0.0 0.0 0.0 7.0 6.0 285 285 TARA_N000000184 1687 281 282 879 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arg Arg ArgArg 0.0 0.0 0.0 0.0 6.0 5.0 286 286 TARA_N000000184 1687 282 283 879 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Leu Leu LeuLeu 0.0 0.0 0.0 0.0 5.0 4.0 287 287 TARA_N000000184 1687 283 284 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Val Val ValVal 0.0 0.0 0.0 0.0 5.0 4.0 288 288 TARA_N000000184 1687 284 285 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Ser Ser SerSer 0.0 0.0 0.0 0.0 5.0 4.0 70 70 TARA_N000000184 1687 285 286 879 902 0 0 0 0 0 0 0 0 0 3 0 0 0 797 0 0 2 0 0 0 100 Phe Phe PheVal 0.11640798226164084 0.1164079822616408 0.12547051442910917 0.38572280296535677 -2.0 -1.972788065232625 -1.0 -0.9787269423097479 289 289 TARA_N000000184 1687 286 287 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Pro Pro ProPro 0.0 0.0 0.0 0.0 8.0 7.0 290 290 TARA_N000000184 1687 287 288 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Pro Pro ProPro 0.0 0.0 0.0 0.0 8.0 7.0 291 291 TARA_N000000184 1687 288 289 879 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arg Arg ArgArg 0.0 0.0 0.0 0.0 6.0 5.0 71 71 TARA_N000000184 1687 289 290 879 923 116 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 803 0 0 0 Thr Thr AlaThr 0.1300108342361863 0.13001083423618634 0.1444582814445828 0.40540770155369993 0.0 -0.07113938692886063 0.0 -0.03796579360489136 72 72 TARA_N000000184 1687 290 291 879 923 0 0 0 0 0 0 0 0 0 0 351 0 0 32 509 0 28 3 0 0 0 Pro Pro LeuPro 0.44853737811484296 0.4485373781148429 0.6895874263261297 0.9371058142414911 -4.0 -3.6159630635059727 -3.0 -2.7339717418720464 73 73 TARA_N000000184 1687 291 292 879 915 0 0 0 1 0 31 860 0 0 0 0 20 0 0 0 3 0 0 0 0 0 Glu Glu GlnGlu 0.060109289617486295 0.060109289617486336 0.03604651162790698 0.28272085169189326 2.0 1.2061642175761718 2.0 1.6047234033178384 292 292 TARA_N000000184 1687 292 293 879 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 STP STP STPSTP 0.0 0.0 0.0 0.0 This file (which scrolls to the right quite a bit), shows for each codon position in this gene, the variability of environmental sequences. The power of this approach comes from the fact that SAAVs represent the allele frequency of amino acids in a single codon position based on the number of short reads that fully cover the codon (which is described here). That’s why the matrix in this output file where the frequencies of individual amino acids are shown is full of zeros for most positions that are outside the VR, yet there clearly is a tremendous amount of amino acid diversity in the environment within the VR site! Visualizing this information is indeed a challenge, but thanks to the comprehensive output file anvi’o generates, it is not impossible with a few lines of R code. Here is what I wrote for this: #!/usr/bin/env Rscript library(ggplot2) library(reshape2) library(reshape) require(gridExtra) # a color palette to use for amino acids later. it is slightly # modified by Kevin Wright's beautiful template here: # https://stackoverflow.com/a/9568659 c25 <- c("dodgerblue2", "#E31A1C", "green4", "#6A3D9A", "#FF7F00", "yellow4", "gold1", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6", "#FDBF6F", "gray70", "brown", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4", "darkturquoise", "darkorange4", "white") # boring theme thingies F <- theme(panel.border = element_blank(), plot.margin = unit(c(0.6,0.3,1,0.3), "cm"), axis.title.y=element_blank(), axis.title.x=element_blank(), axis.line.x = element_blank(), axis.text.x = element_text(face = "bold"), axis.text.y = element_text(face = "bold"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line.y = element_blank(), axis.ticks.x = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(color = "#222222", size = 14, face = "bold"), plot.subtitle = element_text(color = "#444444", size = 10)) # read in the data, which is the output of anvi-gen-variabiliyt-profile df <- read.table(file='DGR2_1687_AA.txt', header = TRUE, sep = "\t") # focus on on the codons that are closer to the VR df <- df[dfcodon_order_in_gene > 125, ]

# get dx from df, a simplified data frame that only contains codon order
# and amino acid frequencies, so we can melt it into a sweet, long-format
# dataframe that is easier to visualize with ggplot:
columns_to_keep <- c("codon_order_in_gene", "Ala", "Arg", "Asn", "Asp", "Cys",
"Gln", "Glu", "Gly", "His", "Ile", "Leu", "Lys", "Met",
"Phe", "Pro", "STP", "Ser", "Thr", "Trp", "Tyr", "Val")
dx <- melt(df[, (names(df) %in% columns_to_keep)], id="codon_order_in_gene")
names(dx) <- c("codon", "aa", "freq")
dx$aa <- as.character(dx$aa)

# we will learn the dominant amino acid per codon position, and name it
# ZREF so we can set an ugly color to only focus on variable sites
for (codon_num in unique(dx$codon)){ dxm <- dx[dx$codon == codon_num, ]
dominant_aa <- as.character(tail(dxm[order(dxm$freq), ], n=1)$aa)
dx[dx$codon == codon_num & dx$aa == dominant_aa, "aa"] <- 'ZREF'
}

# generate the plot object
p <- ggplot(dx, aes(x=codon, y=freq, fill=aa))
p <- p + geom_bar(position="fill", stat="identity")
p <- p + theme(legend.position="bottom")
p <- p + scale_fill_manual(values = c25)
p <- p + scale_y_reverse() + scale_x_reverse()
p <- p + ggtitle('Single amino acid variants')
p <- p + F

# print it into a PDF file!
pdf(file = "SAAVs.pdf",  width = 15, height = 5)
print(p)
dev.off()


Running this code generates this image in a PDF file,

which was straightforward to overlay on the coverage data to generate the following in our beloved Inkscape:

Some SNVs do not have a corresponding SAAV, since not every single-nucleotide variant is non-synonymous.

We are about to move on to an entirely different way to look at these data. But I would like to point out here as a reminder that the output file generated by the program anvi-gen-variability-profile can serve many purposes for much more in-depth analyses of evolution and microbial population genetics.

## Defining an in silico primer and collecting reads

So far we investigated the coverage patterns in a region of a genome that is influenced by the activities of a retroelement, visualized single-nucleotide variants in that region, and confirmed the extensive amino acid diversification of the environmental populations of Trichodesmium erythraeum at this site. For all these investigations, we used a single metagenome for simplicity. But for those who are interested in microdiversity or hypervariability, the question sooner or later comes to those that are of ecology as we inevitably wonder whether these patterns conserved across environments or there is any evidence that they respond to environmental change.

In other words, can you squint your eyes looking at the figure below that shows all the SNVs in that most variable region of the DGR2 across all metagenomes, and tell if some metagenomes are more similar to one another than others** or if each metagenome seems to have a random collection of variants as if sequences were evolving on a Brownian motion path? And if there are groups of metagenomes in which DGR activity seems to create similar diversity patterns, can you tell whether biogeography explains the similarities and differences? Of course these are worthy questions to go after as our answers to them may have significant implications on our ability to ask better questions to understand the role of DGRs (or any other genetic mechanism that tinker with the speed of evolution) on the lifestyles of environmental microbes.

We will get back to whether you can squint your eyes or not to answer those important questions. But there is something that is worth noting here before moving on. The green bars for SNVs in the anvi’o inspection page mark SNVs that are at the third nucleotide position of a given codon, while red ones indicate those that occur at the first or the second nucleotide positions in codons. Changes in the first and second nucleotide positions in a codon are much more likely to influence the amino acid identity than those that occur in the third (since only 30% of changes at the third codon positions will result in a different amino acid). When you look at metagenomes and SNVs long enough, you realize that if you use a genome to recruit reads from a metagenome that contains very closely related populations. genomes of which are almost identical to your reference genome but not quite, most differences between them that yield SNVs occur in the third nucleotide positions of codons. Because, you know, evolution. And there is something magical in being able to see something so fundamental to DGRs even in this little figure above, as a software platform that is absolutely indifferent to DGRs is simply showing us how almost all variation in this region is in the first and second nucleotide positions of codons.

And going back to the topic, no, even though as a computer scientist I very much believe in the power of squinting eyes to understand biology, in some cases we just can’t use that strategy to find answers to some questions. In fact, when variability is so high, we can’t even trust reads that we were able to recruit: the sudden drop in coverage tells us clearly that there are many more reads in the environment that would have matched to this very region, but our mapping software could not place them here. Due to extreme variation, they were no longer have the minimum sequence identity to this reference to be recruited. Working with mapped reads is the field standard. But we clearly need to go back to the raw reads and recover those we are missing.

Luckily, the anvi’o program anvi-script-get-primer-matches does exactly that for us. It takes in a primers-txt, in which we describe a pattern we are interested in, and a samples-txt file, to find all the short reads that match to those patterns and report a FASTA file for each one of them. We already have our samples-txt that looks like this:

sample r1 r2
TARA_N000002360 ERR1726621_1.fastq.gz ERR1726621_2.fastq.gz
TARA_N000000202 ERR1726854_1.fastq.gz ERR1726854_2.fastq.gz
TARA_N000002356 ERR1726640_1.fastq.gz,ERR1726924_1.fastq.gz ERR1726640_2.fastq.gz,ERR1726924_2.fastq.gz
TARA_N000000184 ERR1726697_1.fastq.gz ERR1726697_2.fastq.gz
TARA_N000002364 ERR1726823_1.fastq.gz ERR1726823_2.fastq.gz
TARA_N000000203 ERR1726921_1.fastq.gz ERR1726921_2.fastq.gz

The key part is to define the pattern that would bring us all the short reads that extend into the variable region. For this, we can combine what we learn from the genome and what we learn the environment through SNVs.

This is how our genome sequence looks like in that region (we know that since we zoom in enough, anvi’o shows us the nucleotides matching to each position):

Looking at this, one can design a primer sequence that demands all the nucleotides that is conserved to be present while embracing all the variation for the remaining nucleotide positions. The key is to make sure it is long enough to cover all the conserved regions, but not too long so its total length is not longer than the read length of the metagenomic sequences.

Here is one that takes all into consideration to recover as many metagenomic reads from FASTQ files with as much stringency as possible to understand the variation in this region by cutting off the middle-man, the enemy of hypervariability, the mapping software:

The program anvi-script-get-primer-matches uses ‘regular expressions’ (here is a cheat sheet) where . is equivalent to an N in primer design and means ‘any character is OK here’.

Another way to do it would have been to use the TR sequence and replace every T (not an A since it is reverse complemented) with a ., but doing it based on the environment is a more stringent way to do that. Although, it would certainly be a decision that should be made per question basis.

Now, we know our ‘primer’ sequence, and thus we can generate the primers-txt file:

name primer_sequence
DGR2_VR1 AAGACGAAAACCAA.A..G..G..G..C..C..G..C..G..A..G..G..G

Now running this will get all the matching reads from each metagenome:

anvi-script-get-primer-matches --primers-txt primers.txt \
--samples samples.txt \
--only-report-primer-match \
-o PRIMER-MATCHES


As explained in the program help menu, the flag --only-report-primer-match will only report the parts of sequences that match to the primer and will trim the rest.

When the program is done running, the output directory PRIMER-MATCHES will be populated by as many FASTA files as there are metagenomes x primers (which is 1 in our case), and will look like this:

ls PRIMER-MATCHES/

TARA_N000000184-DGR2_VR1-HITS.fa
TARA_N000000202-DGR2_VR1-HITS.fa
TARA_N000000203-DGR2_VR1-HITS.fa
TARA_N000002356-DGR2_VR1-HITS.fa
TARA_N000002360-DGR2_VR1-HITS.fa
TARA_N000002364-DGR2_VR1-HITS.fa


And each FASTA files will contain reads that are of equal length, and matching 100% to the primer we defined:

>TARA_N000002364_DGR2_VR1_00001
AAGACGAAAACCAATAACGCTGAAGTCCGCCTCGACCGAGTAAGAGTAGTAG
>TARA_N000002364_DGR2_VR1_00002
AAGACGAAAACCAATATCGTTGACGTCCGTCTCGACCGAGTAATAGAAGTTG
>TARA_N000002364_DGR2_VR1_00003
AAGACGAAAACCAATATTGTTGTTGTCCGTCTCGTCCGAGTTATAGTTGTTG
>TARA_N000002364_DGR2_VR1_00004
AAGACGAAAACCAAAATTGTTGTCGTCCGTCACGACCGAGTAATAGTTGAGG
>TARA_N000002364_DGR2_VR1_00005
AAGACGAAAACCAATATTGTTGTTGTCCGTCTCGTCCGAGTTATAGTTGTTG
>TARA_N000002364_DGR2_VR1_00006
AAGACGAAAACCAATATTGCCGCTGACCGCCTCGCCCGAGTAATAGAAGCAG
>TARA_N000002364_DGR2_VR1_00007
AAGACGAAAACCAATATTGTTGTTGTCCGTCTCGTCCGAGTTATAGTTGTTG
>TARA_N000002364_DGR2_VR1_00008
AAGACGAAAACCAACATGGCTGTAGTCCGACTCGCCCGAGTAATAGTCGAGG
>TARA_N000002364_DGR2_VR1_00009
AAGACGAAAACCAATAACGTCGCTGACCGTCTCGTCCGAGAAAGAGTAGTCG
>TARA_N000002364_DGR2_VR1_00010
AAGACGAAAACCAATAAAGTCGTAGTCCGACTCGACCGAGACATAGTAGTGG
(...)


Finally, one can put all the short reads into a single file so we can think about how to make sense of them:

cat PRIMER-MATCHES/*-DGR2_VR1-HITS.fa > DGR2-VR1-HITS.fa


## Performing an oligotyping analysis

The FASTA file DGR2-VR1-HITS.fa contains all the sequences we need to think about the ecology of the DGR activity. Opening it in any program to study sequence alignments will already yield a lot of insights, but if our desire is to compare distinct environments based on their DGR profiles, we somehow need to divide sequences in this file into meaningful groups. At this stage using taxonomy for this is obviously out of question, since we are studying a single taxon. Another option could be using de novo sequence similarities, a problem you may be familiar with from amplicon sequencing, where 16S rRNA gene amplicons must be first turned into operational units based on arbitrary sequence similarity thresholds and compared across samples. We could do that, but the extremely high diversity within these sequences would either yield way too many OTUs with high similarity thresholds or too little OTUs with too much diversity within with low similarity thresholds. We could also use denoising algorithms such as DADA2, but our problem requires much more subtle access to diversity, rather than trying to tame it.

By focusing only on a user-defined number of high-entropy nucleotide positions to decompose a group of sequences, oligotyping offers a very good and scalable solution for this problem with room for extensive supervision. Even though it was designed for 16S rRNA gene amplicons, it has quite useful applications to metagenomics to study microdiversity and/or hypervariability.

Anvi’o already includes programs such as anvi-report-linkmers and anvi-oligotype-linkmers to resolve oligotypes through physically connected nucleotide positions, but since we already have our sequences of interest in a FASTA file, here we can use the stand-alone oligotyping pipeline (which is easy to install and run) to also get summary figures without any additional effort.

As per the instructions on how to run an oligotyping analysis on a given set of sequences, the first step is to calculate the entropy in the file, which we can do the following way on our FASTA file:

entropy-analysis DGR2-VR1-HITS.fa --quick


At the end of this analysis, oligotyping greets us with a figure that shows the entropy results:

Which are exactly as we would have expected given our primer. So far so good. The next step is to run the oligotyping analysis:

oligotype DGR2-VR1-HITS.fa \
DGR2-VR1-HITS.fa-ENTROPY \
--number-of-auto-components 10 \
--min-substantive-abundance 8 \
--output-directory DGR2-VR1-OLIGOTYPING


The parameter --number-of-auto-components tells oligotyping to use the top 10 highest entropy positions to use for decomposing sequences, while --min-substantive-abundance asks for the removal of any oligotype in which the most frequent sequence occurs less than 8 times in the entire dataset to reduce the number of units reported. I did test a range of these parameters, and none influenced the final outcome.

The output message from the oligotyping pipeline tells us to look at the results as a web page at the very end:

Project ..........................................................: DGR2-VR
Run date .........................................................: 01 Jan 22 14:27:15
Library version ..................................................: 3.1-dev

(...)

Sample/oligotype abundance data matrix (counts) ..................: DGR2-VR1-OLIGOTYPING/MATRIX-COUNT.txt
Sample/oligotype abundance data matrix (percents) ................: DGR2-VR1-OLIGOTYPING/MATRIX-PERCENT.txt
GEXF file for network analysis ...................................: DGR2-VR1-OLIGOTYPING/NETWORK.gexf
Total Purity Score ...............................................: 0.87
End of run .......................................................: 01 Jan 22 14:28:02

View results in your browser: "DGR2-VR1-OLIGOTYPING/HTML-OUTPUT/index.html"



The outputs items displayed on this static web page include the following bar plot, which shows each oligotype,

and this dendrogram, that clusters our metagenomes based on DGR-mediated sequence diversity at the variable site of the target gene one of DGR2 in the Trichodesmium erythraeum populations:

Finally, to show where these Tara Oceans Project stations are on the map, we used the library world-map-r. We first downloaded the codebase:

# get the codebase
git clone https://github.com/merenlab/world-map-r.git

# go into it:
cd world-map-r


Then replaced the contents of the data.txt with this:

samples Lat Lon MAG_MAP
TARA_N000000184 -21.4593 54.3041 1
TARA_N000000202 -21.5043 54.3563 1
TARA_N000000203 -21.5043 54.3563 1
TARA_N000002356 22.7091 -158.0247 1
TARA_N000002360 22.7267 -157.9955 1
TARA_N000002364 22.7511 -157.9632 1

and run,

./generate-PDFs.R


which gave us this ugly PDF, the final piece for the Panel E of our Figure:

Then we beautified it a bit and put everything together in Inkscape:

Thank you for reading this far, and I hope this article achieved its goal to demonstrate a high-resolution approach to study within-population sequence diversity in the context of short genomic regions using metagenomics.

Please feel free to reach out to us if you are having hard time following it, or you need guidance to conduct similar analyses on your own data.