A Wolbachia plasmid from Culex pipiens ovaries
Table of Contents
- Setting the stage
- Downloading individual C. pipiens ovary metagenomes
- Running the anvi’o metagenomic workflow
- Taking look at the output files
- Reconstructing Wolbachia population genomes
- A putative Wolbachia plasmid: pWCP
- Investigating the circular and extrachromosomal nature of pWCP with long reads
- Read recruitment experiments with pWCP
- Computing the Wolbachia metapangenome
- Downloading wPip reference genome
- Computing the initial pangeome
- Identifying genes that correspond to WO prophages
- Recovering coverage values for gene clusters of the Wolbachia pangenome in C. pipiens metagenomes
- Importing additional data into the pangenome and finalizing it
- Computing the average nucleotide identity between Wolbachia genomes
- Computing the density of single-nucleotide variants for Wolbachia populations
This document describes the reproducible bioinformatics workflow for our study titled “The Wolbachia mobilome in Culex pipiens includes a putative plasmid”.
The doi:10.1038/s41467-019-08973-w serves the open-access peer-reviewed study described in this document.
Here you will find program names and exact parameters we used throughout every step of the analysis of metagenomes we generated from the ovaries of four C. pipiens individual mosquitoes. Our computational investigations predominantly relied on the open-source analysis platform anvi’o, and utilized the following strategis in general:
-
Reconstructing high-quality microbial genomes from metagenomes (i.e., metagenome-assembled genomes, or MAGs),
-
Characterizing microdiversity within environmental populations our MAGs represented through single-nucleotide variants,
-
Using metapangenomics to investigate gene content of metagenome-assembled and reference genomes and its association with metagenomes,
-
Applying a long-read sequencing and analysis strategy to very low-biomass samples for hypothesis testing,
-
Employing anvi’o snakemake workflows to study complex datasets with many steps of analyses in a reproducible manner.
All anvi’o analyses in this document are performed using the anvi’o version v5
. Please see the installation notes to download the appropriate version through PyPI, Docker, or GitHub.
Summary
The purpose of this doucument is to focus on the methodological aspects of our workflow, hence, we refer you to the peer-reviewed study for more extensive insights into its scientific context. Nonetheless, here is a brief summary of our study:
-
We captured four wild mosquitoes in Southern France, and performed microsurgery to extract their ovaries (ouch),
-
After shotgun sequencing, we reconstructed de novo population genomes for Wolbachia, the endosymbiotic alphaproteobacterial genus of many arthropod and nematode species worldwide,
-
Identified a circular, extra-chromosomal element with hallmarks of a plasmid in the mobilome, and confirmed its occurrence and circular nature using PCR, Sanger sequencing, and long-read sequencing experiments,
-
Used publicly available mosquito metagenomes to show the occurrence and conservancy of the plasmid, and the diversity and richness of phages in individual mosquitoes.
Sections in this document will detail all the steps of downloading and processing the raw metagenomes, mapping metagenomic reads onto the reference genomes as well as metagenome-assembled genomes.
Please feel free to leave a comment, or send an e-mail to us if you have any questions.
Setting the stage
This section explains how to download and process the metagenomes we have generated from the for the C. pipiens ovary samples used in this study.
We made available the metagenomes through the European Bioinformatics Institute (EBI) repository and NCBI under the project accession number PRJEB26028
.
Downloading individual C. pipiens ovary metagenomes
The raw Illumina paired-end seqeuncing files for shotgun metagenomes are stored in European Nucleotide Archive under the accession number PRJEB26028, and the following table lists the individual accession numbers for each sample:
Accession Number on ENA | Sample Name In our Study |
---|---|
ERS2407346 | Culex pipiens O03 |
ERS2407347 | Culex pipiens O07 |
ERS2407348 | Culex pipiens O11 |
ERS2407349 | Culex pipiens O12 |
If you wish to follow this workflow, you can download raw Illumina paired-end seqeuncing data file for 4 ovary samples into your work directory the following way:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/000/ERR2523110/ERR2523110_1.fastq.gz \
-O Culex_O03_RAW_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/000/ERR2523110/ERR2523110_2.fastq.gz \
-O Culex_O03_RAW_R2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/001/ERR2523111/ERR2523111_1.fastq.gz \
-O Culex_O07_RAW_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/001/ERR2523111/ERR2523111_2.fastq.gz \
-O Culex_O07_RAW_R2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/002/ERR2523112/ERR2523112_1.fastq.gz \
-O Culex_O11_RAW_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/002/ERR2523112/ERR2523112_2.fastq.gz \
-O Culex_O11_RAW_R2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/003/ERR2523113/ERR2523113_1.fastq.gz \
-O Culex_O12_RAW_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR252/003/ERR2523113/ERR2523113_2.fastq.gz \
-O Culex_O12_RAW_R2.fastq.gz
Once your downloads are finished, you must have the following files in your work directory for each of the four sample:
- Culex_O03_RAW_R1.fastq.gz, Culex_O03_RAW_R2.fastq.gz
- Culex_O07_RAW_R1.fastq.gz, Culex_O07_RAW_R2.fastq.gz
- Culex_O11_RAW_R1.fastq.gz, Culex_O11_RAW_R2.fastq.gz
- Culex_O12_RAW_R1.fastq.gz, Culex_O12_RAW_R2.fastq.gz
Running the anvi’o metagenomic workflow
While we extensively listed our command lines and parameters in our previous reproducible workflows (an example, and another one), anvi’o v5
comes with a snakemake-based workflow management program, which makes our workflows significantly simpler (here is an extensive tutorial that describes its functions). Starting from raw metagenomic reads, this workflow will run everything necessary.
One of the key input files to start the run is the samples.txt
, and you can downlaod the version used in our study into your work directory this way:
wget http://merenlab.org/data/wolbachia-plasmid/files/samples.txt
Here is a quick look at its contents:
$ column -t samples.txt
sample r1 r2
Culex_O03 Culex_O03_RAW_R1.fastq.gz Culex_O03_RAW_R2.fastq.gz
Culex_O07 Culex_O07_RAW_R1.fastq.gz Culex_O07_RAW_R2.fastq.gz
Culex_O11 Culex_O11_RAW_R1.fastq.gz Culex_O11_RAW_R2.fastq.gz
Culex_O12 Culex_O12_RAW_R1.fastq.gz Culex_O12_RAW_R2.fastq.gz
The other key input file is the config.json
, which dictates exactly how the input desribde in samples.txt
should be processed. We used the following command to generate a ‘default’ config file for the metagenomics workflow,
anvi-run-workflow -w metagenomics \
--get-default-config config.json
And edited it slighlty before running the workflow. You can download our config.json
file into your work directory:
wget http://merenlab.org/data/wolbachia-plasmid/files/config.json
The content of which should look like this:
{
"samples_txt": "samples.txt",
"all_against_all": true,
"anvi_gen_contigs_database": {
"--project-name": "{group}",
"threads": 5
},
"centrifuge": {
"run": "false"
},
"anvi_run_hmms": {
"run": true,
"threads": 20
},
"anvi_run_ncbi_cogs": {
"run": true,
"threads": 5
},
"anvi_script_reformat_fasta": {
"run": true,
"--simplify-names": true
},
"idba_ud": {
"run": true,
"--min_contig": 1000,
"threads": 11
},
"iu_filter_quality_minoche": {
"run": true,
"--ignore-deflines": true,
"threads": 2
},
"gzip_fastqs": {
"run": true
},
"bowtie_build": {
"threads": 10
},
"bowtie": {
"additional_params": "--no-unal",
"threads": 10
},
"samtools_view": {
"additional_params": "-F 4",
"threads": 4
},
"anvi_init_bam": {
"threads": 4
},
"anvi_profile": {
"threads": 5,
"--sample-name": "{sample}",
"--overwrite-output-destinations": true
},
"anvi_merge": {
"--sample-name": "{group}",
"--overwrite-output-destinations": true,
"--skip-hierarchical-clustering": true,
"--skip-concoct-binning": true,
"threads": 6
}
}
And it basically instructs the workflow manager to,
- Quality filter short reads using illumina-utils,
- Assemble each sample independently using IDBA_UD,
- Recruit reads from all samples using each independent assembly with Bowtie2,
- Profile and merge resulting files using anvi’o.
The parameter all_against_all: true
will result in individual assemblies of each sample, but reads will be recruited from every metagenome to each assembly. Please refer to the anvi’o snakemake tutorial for the details of the available modes of assembly and mapping.
Just to make sure things look alright, we run the following command to generate a visual summary of the workflow:
anvi-run-workflow -w metagenomics \
-c config.json \
--save-workflow-graph
We finally run it the following way:
anvi-run-workflow -w metagenomics \
-c config.json \
--additional-params \
--cluster 'clusterize -log {log} -n {threads}' \
--resources nodes=40 \
--jobs 10 \
--rerun-incomplete
Please note that the --cluster
parameters are specific to our server, and will require you to remove that paramter, or format it depending on your own server setup if you wish to reproduce this workflow.
Taking look at the output files
Successful completion of the anvi’o metagenomic workflow results in various standard output directories, including 00_LOGS
, 01_QC
, 02_FASTA
, 03_CONTIGS
, 04_MAPPING
, 05_ANVIO_PROFILE
, and 06_MERGED
. The following is the brief summary of the contents of these directories:
-
00_LOGS
: Log files for every operation seen in the graph above. -
01_QC
: Quality-filtered short metagenomic reads and final statistics. -
02_FASTA
: IDBA-UD assembly of each sample independently in the form of FASTA files. -
03_CONTIGS
: Anvi’o contigs databases for each assembly with COG annotations and HMM hits for single-copy core genes.
$ ls 03_CONTIGS/*db
03_CONTIGS/Culex_O03-contigs.db 03_CONTIGS/Culex_O07-contigs.db
03_CONTIGS/Culex_O11-contigs.db 03_CONTIGS/Culex_O12-contigs.db
04_MAPPING
: Bowtie2 read recruitment results for assembly outputs from quality-filtered short metagenomic reads in the form of BAM files.
$ ls -R 04_MAPPING/
04_MAPPING/:
Culex_O03 Culex_O07 Culex_O11 Culex_O12
04_MAPPING/Culex_O03:
Culex_O03.bam Culex_O03.bam.bai Culex_O07.bam Culex_O07.bam.bai
Culex_O11.bam Culex_O11.bam.bai Culex_O12.bam Culex_O12.bam.bai
04_MAPPING/Culex_O07:
Culex_O03.bam Culex_O03.bam.bai Culex_O07.bam Culex_O07.bam.bai
Culex_O11.bam Culex_O11.bam.bai Culex_O12.bam Culex_O12.bam.bai
04_MAPPING/Culex_O11:
Culex_O03.bam Culex_O03.bam.bai Culex_O07.bam Culex_O07.bam.bai
Culex_O11.bam Culex_O11.bam.bai Culex_O12.bam Culex_O12.bam.bai
04_MAPPING/Culex_O12:
Culex_O03.bam Culex_O03.bam.bai Culex_O07.bam Culex_O07.bam.bai
Culex_O11.bam Culex_O11.bam.bai Culex_O12.bam Culex_O12.bam.bai
05_ANVIO_PROFILE
: Anvi’o single profiles for each sample.
$ ls 05_ANVIO_PROFILE
Culex_O03 Culex_O07 Culex_O11 Culex_O12
$ ls 05_ANVIO_PROFILE/Culex_O03/
Culex_O03 Culex_O07 Culex_O11 Culex_O12
$ ls 05_ANVIO_PROFILE/Culex_O03/Culex_O03/
AUXILIARY-DATA.db PROFILE.db RUNLOG.txt
$ find 05_ANVIO_PROFILE -name PROFILE.db
05_ANVIO_PROFILE/Culex_O11/Culex_O07/PROFILE.db
05_ANVIO_PROFILE/Culex_O11/Culex_O03/PROFILE.db
05_ANVIO_PROFILE/Culex_O11/Culex_O12/PROFILE.db
05_ANVIO_PROFILE/Culex_O11/Culex_O11/PROFILE.db
05_ANVIO_PROFILE/Culex_O12/Culex_O12/PROFILE.db
05_ANVIO_PROFILE/Culex_O12/Culex_O11/PROFILE.db
05_ANVIO_PROFILE/Culex_O12/Culex_O07/PROFILE.db
05_ANVIO_PROFILE/Culex_O12/Culex_O03/PROFILE.db
05_ANVIO_PROFILE/Culex_O03/Culex_O11/PROFILE.db
05_ANVIO_PROFILE/Culex_O03/Culex_O12/PROFILE.db
05_ANVIO_PROFILE/Culex_O03/Culex_O03/PROFILE.db
05_ANVIO_PROFILE/Culex_O03/Culex_O07/PROFILE.db
05_ANVIO_PROFILE/Culex_O07/Culex_O03/PROFILE.db
05_ANVIO_PROFILE/Culex_O07/Culex_O07/PROFILE.db
05_ANVIO_PROFILE/Culex_O07/Culex_O11/PROFILE.db
05_ANVIO_PROFILE/Culex_O07/Culex_O12/PROFILE.db
06_MERGED
: Anvi’o merged profile databaes for each sample.
$ ls -R 06_MERGED
06_MERGED:
Culex_O03 Culex_O07 Culex_O11 Culex_O12
06_MERGED/Culex_O03:
AUXILIARY-DATA.db PROFILE.db RUNLOG.txt
06_MERGED/Culex_O07:
AUXILIARY-DATA.db PROFILE.db RUNLOG.txt
06_MERGED/Culex_O11:
AUXILIARY-DATA.db PROFILE.db RUNLOG.txt
06_MERGED/Culex_O12:
AUXILIARY-DATA.db PROFILE.db RUNLOG.txt
DATA doi:10.6084/m9.figshare.6263867 serves the raw assembly output for contigs longer than 1 kbp.
DATA doi:10.6084/m9.figshare.6263876 serves the anvi’o merged profiles for each mosquito.
Reconstructing Wolbachia population genomes
To reconstruct population genomes we first automatically divided all contigs into five bins. Our expectation was to find bacterial contigs of the endosymbiont to only occur in one of the resulting bins, so we could manually refine the population genome minimizing artifacts from automatic characterization of tremendous amount of contigs. This is a strategy our group used before to curate high-quality genome bins from complex environments.
We run the following command in our work directory to run CONCOCT to generate five bins in our merged profile databases:
for s in Culex_O03 Culex_O07 Culex_O11 Culex_O12
do
anvi-cluster-with-concoct -p 06_MERGED/$s/PROFILE.db \
-c 03_CONTIGS/$s-contigs.db \
-C CONCOCT_5 \
--num-clusters 5
done
anvi-cluster-with-concoct has been superseded with anvi-cluster-contigs
Conclusion of this step added a collection, CONCOCT_5
, to each of our merged anvi’o profile databases with five bins in them. We then surveyed each mosquito with the following command to find out which bin likely contained the population genome (this example shows the output for O03
):
$ anvi-script-get-collection-info -p 06_MERGED/Culex_O03/PROFILE.db \
-C CONCOCT_5 \
-c 03_CONTIGS/Culex_O03-contigs.db
* Completion and redundancy estimates. PC: Percent completion; PR: Percent
redundancy; N: Number of splits; L: Length (total number of nucleotides); D:
Domain for single-copy core genes; C: Domain confidence.
Bins in collection "CONCOCT_5"
===============================================
Bin_3 :: PC: 94.24%, PR: 0.72%, N: 7,883, L: 25,685,408, D: bacteria (C: 0.94)
Bin_2 :: PC: 41.36%, PR: 20.37%, N: 8,398, L: 35,763,497, D: archaea (C: 0.21)
Bin_1 :: PC: 5.04%, PR: 0.00%, N: 24,122, L: 91,074,451, D: bacteria (C: 0.05)
Bin_4 :: PC: 9.35%, PR: 2.16%, N: 4,243, L: 14,285,067, D: bacteria (C: 0.07)
Bin_5 :: PC: 0.00%, PR: 0.00%, N: 1,806, L: 5,889,258, D: bacteria (C: 0.00)
For the case of O03
, it was clear based on percent completion (PC
) and percent redundancy (PR
) estimates that Bin_3
contained the population genome (bin names differed for other individuals). We then used anvi-refine
to manually identify the poplation genome and curate it. Running the following command,
anvi-refine -p 06_MERGED/Culex_O03/PROFILE.db \
-c 03_CONTIGS/Culex_O03-contigs.db \
-C CONCOCT_5 \
-b Bin_3
gave us an anvi’o interactive interface, from which we could select the bin, and additional contigs after manual inspection and BLAST searches:
For instance, the additional contig in the figure above is an The additional piece for instance is a contig carrying an ankyrin repeat domain, a known component of Wolbachia, that only occurs in two of the insects in our dataset, hence would have been missed by automatic binning:
Once we were satisfied with our population genomes, we stored them in a bin called Wolbachia
. We then did the following to have a new collection in these profile databases called FINAL
which described a single bin called Wolbachia
:
# export the CONCOCT_5 collection with Wolbachia bin
anvi-export-collection -p 06_MERGED_2500nts/Culex_O03/PROFILE.db \
-C CONCOCT_5
# create a file with only Wolbachia splits:
grep Wolbachia collection-CONCOCT_5.txt > FINAL.txt
# import FINAL into the profile database:
anvi-import-collection FINAL.txt \
-p 06_MERGED/Culex_O03/PROFILE.db \
-c 03_CONTIGS/Culex_O03-contigs.db \
-C FINAL
# remove temporary files
rm FINAL.txt collection-CONCOCT_5*
This way, every merged profile database contained a bin called Wolbachia
in the collection FINAL
for downstream analyses. Here is how it looked for O03
:
anvi-script-get-collection-info -p 06_MERGED/Culex_O03/PROFILE.db \
-c 03_CONTIGS/Culex_O03-contigs.db \
-C FINAL
* Completion and redundancy estimates. PC: Percent completion; PR: Percent
redundancy; N: Number of splits; L: Length (total number of nucleotides); D:
Domain for single-copy core genes; C: Domain confidence.
Bins in collection "FINAL"
===============================================
Wolbachia :: PC: 94.24%, PR: 0.72%, N: 101, L: 1,218,385, D: bacteria (C: 0.94)
Once we stored a Wolbachia
bin in every anvi’o merged profile database, we summarized these collections so we later have access to gene coverages, gene functions:
for s in Culex_O03 Culex_O07 Culex_O11 Culex_O12
do
anvi-summarize -p 06_MERGED/$s/PROFILE.db \
-c 03_CONTIGS/$s-contigs.db \
-o 06_MERGED/$s/SUMMARY \
-C FINAL \
--init-gene-coverages
done
DATA doi:10.6084/m9.figshare.6292040 serves FASTA files for Wolbachia population genomes.
DATA doi:10.6084/m9.figshare.6263876 serves the anvi’o merged profiles for each mosquito that includes a collection FINAL
with a Wolbachia
bin.
A putative Wolbachia plasmid: pWCP
Our manual investigations of the four Wolbachia population genomes, coverages of their genes and contigs across metageomes, as well as closer inspections of gene functions and BLAST searches revealed a putative plasmid, which we tentatively named pWCP. Our study describes pWCP in more detail.
The following sequence is an artifially circularized version of pWCP from a single mosquito O11:
>pWCP
AATATTGTAAGTCAAAAATATGATATAAAGGAAATAAGAGGGAAAGATAACCCTACTCACTATAGGAATAGGGTTATCAC
GGGCGTGTGCGACCGGCAGAATATTGGTATTACAACCGTGTCCATCAAGGTACACTAGCCCATCTCTCGATACGTTTGAA
TAGTTGTGTGGGACATGTATATGCACCCGTTTACAATCTTAAATTAATTAAACTATCGAGGTTATAATGATTACATCTTA
TCAAAATTTTATTGGCATTGATATCGGGAAATTTAAAAATGTTGTTGCAGTTCACGAACAGAAGAATACTGTAGGATTTG
ACAATAATACTTCTGGCTGGCAACAATTGTTTCAAAAGTTTTCAGATATCCTACCTAATTCTTTAGTAACTTTAGAAAAT
ACAGGAAAGTATGAGCTTGGTTTATTACATTTTCTTGTTGACAAAAATATTGCCACACATCGAGCTAATACCCGTAAAGT
AAAAAGCTTTATCTTGTCTCACGGAACTTTAGCAAAGTCTGATAAATCAGATGCAAGAGCTCTTGCTCAATATGGATTTG
AACGCCACAGCACTATCTCTCTATTTGTACCCACTTTTGAAAATCAATCAACCTTGGTTGCACTTTGTCAACGTCGTGAT
GACATTACGCAAATGAGAGCTCAGGAAAAATGTAGACTGGCAGCACCTGAAAATGACCATATAAAAGAAAGTTGTCAAAA
AACTATCGAATTCTTTAATATTCAAATAGATGAACTCAATAATACTATACAAAAAATTATTGATGAAAGCCGTGAGTTAC
AACAACGTCAAAAAATCCTAAAAACAGTTCCTGGAATAGGTCCAAAGTTATCTCAAGATTTTCTCTGTTTAATGCCAGAG
CTTGGTTACTTAAGCAGGAGAGAAGTTGCAAGCCTTGCTGGGGTTGCACCTTATCCAAAAGAAAGTGGTAAAACTATTGG
ATACCGAAGAATAACAGGTGGTAGAAGCAATGTTCGTGCAAAACTTTTTACATCTGCCATGGCTGCTGCAAAGTCCAAAT
CTGTACTTGGTGCCTTTTATTCTAAGCTTGTTGAAAGTGGTAAGAAGAAGATGGTAGCTATAACAGCTCTAATGCGTAAA
ATTATAGTAATTGCTAACGCAAGGCTTAAAGAAGCAATTAATTTGAATATTTAAAATTTACATAGCAGTAATTGAAATGA
ATATTGTGAATAAGCACGTCCGCACAAACTTAAAGTGCTTATTCACAATAGCTTAAACGGAAATTGTTGCAGCAGCTGTT
TGATATAGAATTCTGTTTCTATGACAAACGGGTTTTTATTATCTATATGGTTGGGTAAAGTTGCAAAATTATACAAATAA
AAAATTTAAAAAACATAGTTGATGGAGCTTCTCTACCAAGGATTAAATGAATTAGCCTTTTTAAATCTCCTTGTCGATAG
TTCAGGAAGTATGGATTTAAATGTGCTGAGAAGAAAGATTGCATCAATCTGTGAAAAACACAGCATCAAATGCATCTACA
TTGACTATCTGCAGCTACTCAGAGGATCAGGAAGAACAGAGAACAGAACGCTTGAAGTGACAGAAATCAGCAGAACATTA
AAGAACATCGCTAAAGACTTTGCAGTTCCTGTTGTCGCGCTATCACAAATTAGTCGCAGAGTGGAAGAAAGGCAAAACAA
AAGACCGCAACTAGCAGACCTGCGAGAGTCAGGGAGCATCGAACAAGACGCTGATATAGTTTTACTACTCTACAGAGACA
GCTACTATAAGATAGGCTCTAATAACGAACTAGAGATAAATGTCGCAAAGAATAGAAGTGGCTCAACCGGTAAAGTTACA
GTGCGCTATCAAGTTAATACTGGAAGAATATTGATTTAACGTTTTAGTAAGTCTAATATTGCCTGATTGTAAATGTTATC
TCTATGTCCTATTGAGACAACAGTTACTATACGTTTTGCAGTATCTACACGGTAGACGATACGATAATCACCCACTCGTA
TTCTGCGATATCCCTTTAATCTATTGCGTAATGATACACCGCTAGCAATAGGATCAGTCTCAAGATATTCTTTTATTGCT
TCTTTAATACTTGGTTTAATGTCTTCTGGTAAGGAAGGTATATTCCTCTTAATAACATGTTTGAGATATTTGATAGTATA
ACGTTTATTTCCAGATGTCTTCACTATCTTCTATTTCTTCCGCATCATCATCACTTAATTCTCTAATAATTTTAGAAAAC
GCAATGTCTTCTGCTTCAAGCTCAATTGCTTCTTTGACTAACTTTCCTGCTAATTTTTGAACAGACTGCTTGGTAGCTTT
AGCTAGTTCAATAAGGTGCTGTGAAGTTTCTGTATCGAAGGTTACATTAATCCTTGAATCTGCCATAAGTTTTACCAAAG
TTTATCAGTAATTATATAATATTTTCCTGAATTTTTCAATAAAGATCTCTCCTAAACTTAGAAGAGATCTTTATGGTTAA
ACTACTTTGATTTGCTCGAAGTCTGTAAAGCAAAGTATGTGCTTGAATTCACGAACGAAGAAGGTAATGAACCCCCTAAC
TGTTCTAGCTCTCTCAGAGAAAGCTTCCCTTCCATAACACACCTTGCGTTTTTCTCTTCTTGTGTCTCAGAAATTTTTTG
ATTATACTTTGAACACGAGGCACTCTCATCATTTCTTTTAAAATCTTTCATAATTAATCCCTTAAAGCTATAAAAAATAC
TAAACTAGTATACAGTAAAACATAATACTTTAGCAAGTTACTATCCCAGTTGATTTAGTGTTTTCTCAGATATTAAACCT
TTTTTTAGTAATAATGTATTTTTCACTTCTCTTAAATCTGTAGAAGGACCTCCAAGTTGATCCACAACTTCTGAAATATC
CCATTTAGTCGGATCTGGCATATGCGGTAAAAGCCTTCCCCATGAACCTCCTAATTGCTCTAACTCTTTTTCAGGATATT
TTTCCTGCTGCACCTGTGAGCCCAATTCTTTAAGTGCAAACCTTGCCTCTGGACTTTTTGCTATAAAATCCTCTACTTTA
TTAGCTCCTTCAGATAAAAGCTTTACCTCCTCAGAACTTTCTTGTTGCAATTGTGACGCTAACGCTTCATACAAAGGCTT
TGCTTTACCTTGAGGACCTAATATTACCTCTGTATTTGCTTGCAACTGTTCTTCAGTAATACCATTTCTTTCAATAAACT
CGACAGGCTTGTCAGAATGAATAAGAACTTTGATCTTACATTCTTTTCCTTCAACTTCCCAATTAATAGTCATTTCATAT
GAATCATGAGCTACCTCATATATTCTTGTTCCATCCTCACAAAGGGACAAACGTATTCCTCTCGTTTTATTATCGTCAGG
AGAGCAAAGAGTAATTGTTGTAATACCTAGTTCATTGCATGATTTACCACTCAAAAAGTCTAGCAAGTTGATTTCTTTTT
TACCTTCAAGCCACTCAAGCCTTATGGAAGGAAAAGCAACTCCATCTTCATCTACATGTTTTTCAAGACATAAAAATTTT
AGTTTTAAACCGCTATCTTTTATTTTTCTTAATACTTCTTTTTTCGTTTCATCTGTAACAAATCCTGTAATTATAGGTCG
TTCACGATAATTCATATCATACAATTTGCCATTGCCCAGCATATCAGCATGAAAATAATCAGAAGAAAATTTCTCCCACG
AATCCTTATCTGAAAAGTGATGGCCATGTGTATTACCAGGAATAAAATCACTTACAGAATTTCTACTAAAACCCTTATCA
AGAAAAAGTGGAACTAAAGTTTCCCCAAACTTCTTTTTACTTGACCAAATCGATTTTCTTAAAGCATTTAACATACTAAC
CCCCATATTAATGCCATAATAAATTTATTATGACATTACGGAATATGCTAAGTCAAGTTTATAACTTTAAATTTTTATGA
AAATTAACAAAAAAATCATACTTTACATGGCATGAAAAACGCCGTATTATACCACCTCTAGTAGAAATCAAGCTTAAAAT
TTAAGGGAATTTGCGTAAAGGGCATTTCTGCCCTTTACACGCATTTAAATTTTTAGTAGATTCGAGCTGCAAACAAAAAA
TCTTACTAAGACACATGAGGCACAAAAACAACAGAACGTATCGCACCAGAACGTATCAAAAAAAATTGTACAGAAACGAA
AAAAAGAGTCAAGCAAAAAAAAGCTTGACTTCCCATTATCGCGTCTATCACAACTTCAGCATAAGAGGAAGCCTTCATAT
AGTTAACGTAAACGGTAAATATGTGCAGGCGCAAATAGGCTCTAGAAGGCGAATAAAGGCGCGAGAAAAGTCAATTGAGC
TTAAGCGTCCGATAAGAACTTTCGACACGCGTCTTACTGGCAGTGAACTACTACAAGTTTTAAAAAACAAGGAAGTAAGA
TGGTTTAAAAACGGCACTGCTGTTCCTCCGGAACATAGTACTTATCTTTTAGAGGCTTTTAATCCACACAAAGCAAAGAA
AATCTTTTCCAGTTCAGCGCAAAAAGCGGCGGAAAAATACTTTGTCAGAGTAGTAAAAGAAAAATTTCGCGATTTAAAAA
CAGCGAAAGCACCGGATGTGCTCGCAATTAGAGAGAAAGCAGTAGAAAAAGTTCTCGAAAAGTTCAAGAATCACTGTGAA
TTTATATCAATAGAGGTATACATTAAGAACTTATGGATAACTTTCGACGAACATCTCTATAGCAAGATGAGCATGAAAGA
AAAAGTGGACCACCACATCGATTTGATCATACAAAACACTAGTCAGGAGAGAGAGGAAGAACTGCTTGCTGCAGGAGAAA
CGCCGGAGGAAGTATACCACACTTTCAGAACGGTCCAGGGATCGAATCTGTATCGCATCAGGCTGCTGAGCGAGCGAGGA
GGAGCTGCGCTTGAGATATACATCAGGAAAGAGCTGCAGCGGCGGCTAAAACTCCTTTTCAACCACTTTAAAGATAGAAT
GATCAGAATTGGTGTCGACACGGAGAAAGACTTTTTCTCGATTGCCCGCCGGCATTTTTCGTTTTTATTCACGAAGAAAT
ACGTCAAAAACAGAAAGAAAGTGATGAGGGAGGTACGTCATATCCTTGCTCTCATCAGAGAAAACCCGTTTGATGTTCTG
AAGGATTACGACAATTTATGCAAATTTGATCCACCACCTCTCTAGAGAAAGCAAAAATGTTACCCGAAAGTGACCCCCAT
CATGCTTAATTGAAAAAAACAATAAAAAGTGCGACGATTATATTAGTTTTAGTATTAGGAAATTGCTCTTACTTAAAAAA
TAATAAGGTAAAATCCTTTCATTTTTCTTTCTATTTAAAAAAGTTATACACAATACAATATATAAAAGTATATATTATAG
TTTTTATAGTATATTCTTATTATAAAAGTCGAAAGTACCCCTATATTATCACTATAAAGGAAGAAAGAGTTATAGAGAGG
TTATTCACAGTTATTTCTTCAAAGTTTTAATGTATTCAGTTAAAGCCCTGAATCCAATGTTAGAGACCGTATCTTGAGTT
TTAAAGGCTATTTCTTTCACTTCTCTACAAAGATCAGGTGAAAGCCTTAATGTTAATCGTTGTTTTTGTTCCCGTTCTTT
AATAAAAGCATGTCTCTGTTGATCCTCTTTTTTATAACCCACATCAGAGATTCTATGTAATCTATGTTCCTTCGATCTAT
GTTCTGAAATTACAGGCATTATTATTTTCTTAGTGTCCACTAAAAACCTCCTTGTAAATAGATTTAATTTCCTCAATTGC
CTTTGAATCGGCTGATTCAACAATCGTCAAACCTTCAAGAGAGGAGTCTTCAACTACAGCCCGCTGACTCAAAAAACTAT
CGAGCCTTGTTAAATTATCAAAATGACTTAGAAAACTATCACATTTTGTGATTATTTTTTTAGCTCTAGTTGGATTAGTA
TTAACTCGATTAAGTAGGATTCTTGCTTTAAAGATTTTATTACAACTCCGAGCACCGGCAATTAAATTGTTAAGAACTTT
GAATGTCCACACATCAATTCCAGATGGAACAATCGGAAAGATCACTATATCAGCTAGTAATAAAGCAGCTCTTAATGCTT
CATTATTACCACCACCTACATCAACAACAATGTCTTGATATTCAGAATGCAAAGCTGTTAGTTCATTTTTGATTGCTATG
CCATCTCGTGGCCCTTGACTACTGCTAATCACCGGTAAGTTTATGTCCTGATCTCGTTGAGATGCCCACAAGCTGGCCAT
CTCTTGATGATCAATATCATAGAGGTGAACGTTTCTACCTTCTAGAGTACGCATTAATGCTATGTTTGTGGCTATGGTTG
TTTTACCACAACCACCTTTCTGACCACCGACTAATATTATCATGCCTACCTTAGTTATAAAACAACATCTTAGTGACACG
TCACCATGTCGTCAAGTCAACGTGTCACCATGTCAACGTGTCACCATGTCACCATGTCACCATGTCACCATGTCGCAAAC
AAGGCTTATTCACTGAATAATCAACAATTCTGTTAATAACTACAAACACTCAAGCAAGAAAAAGGAGCTTTTGCTCGCAC
AGAAACATCGTTTTTAGCGTCGTTCTTCAAAATCGTATAATTATACACGTTATCGGGGCAAGGTATCCCTAGACGGTTTC
TTTCAGGAAACCACATGATGAACTCACTGAGGACATCGGCGATATCATCGGTGAACTCAGTTTCAGTATGAATCTTTTCA
AACTCATCAAATATCAAATAATGAAAATCTCTTTCCGCTTTCTCAAGAAAGATATGACCTTGCTCAACAAACTGACTGCA
AGACTGAAACACTTCCATCTTTGACCTCGTTCTTCTTGAGCCTAAAATCAGTGTGTCTGGCAATAGCATTCGAAGATCTG
CGATCACCTTAAAGCCAATACCGTGTTCTTCCACGACTACTGCGTGCAGCCTATGCTCATATTTACTGAGAAACTCATGA
ATATCGTGTTCAACTGAACTCGGGTCAAGCTTTCTTGCCAATAAATCCACCCATACTCCACATCTTTTTGGCTGATGGGC
AAAATCACTCTTCGTAAAGAACGGCTGAAACACTCCTACCGCTGTTCTATCAAGTGTTTTCAGCGAACCACTGTTGAGAT
CGATAAATGCTACCAGATCTTTTTCTCCTATTTTCTCATAAGGAAAATCAATATTCCTAATAATGTCTTGCAATACTATA
CTTAGTATTAACTTCTGTGTCATTTCTCCCGAGCATCAGCAAAAGCCAATGACTCTGTATACTCAACCTACCATAATCTG
TCAACATAATTCTTTATTGAAAATTTATTAACACAGGTTGAGTATTTTTCAATTATATCTATAATAAAAAGATGGATATG
AAATACCGTTATTGGGACTCTCCAGAAACAGAAGAGATTGTTCGCAAAACTTACGATGCTCAATTAAAGATTATAGATGA
TATAAAAAATGATAAAGAACTTGGGAAATTAGCATTAGAAATATTATATACCGTTACTTACTGTAATTTACGGGATCTTC
CCTTATTAACTTTTTATAGTTCAGCTACCACACATCATAATAATCTTTATGGTGGTGTGGATCTTCATACACTTTATGAA
CTTTATGATGAACCTTTTGAAGATGCTGAAGATTTACTAATGGCTTCTTTTGATTTATTAGTAAAATATCAATTGGTTGA
TTGTTATAAAAATGGAGTGATACAAGTGTTATCCATGCATCAAGTCACTCGAAGCAATATACAGCTTATGTTAAAAGAAC
AAGGTATAGAAGAAGAAGTTTTAAAAAAAGCCATAAGTGTATTATCGAGATTTCTGAGTGAGCTCAAATATTCATTAGAT
GACAAGGATGAAGGAGCGGGTTTTTTTTCTTTCATAGAAGGAATAATTCCGTATGCGTGCTCTAATGCACAACACGTTTT
ATATCATGCAGAGGAATACAAAGAATTAAATAAGTATTGCGAAAAAGAAAAATTAGAACTTTTAATTGATTATTTCGTAG
CAGCAAGACTTTATCAAAAACATAGTATTACTATACATCACTAAATATTAAATATATTTTATACAGCCCTCCTTCCTCTT
CACCTTAACAATTCTTTATTGAAAATTTCATGAAAATGTTGTATAATTGTAGAGTAGTAGAACTCGGTAAAACTCATGGC
AAACGCTAATTTAGCTTTCAGTAAAGAAACTTTACAGCATCTTGCTGAATTATCTGAACTTACTAAGCAACCTGCTCAAG
CATTAGCAGAAAAACTACTTAAAGAAGCAATCGAGCTTGAAATCGAAGATTTTTTAGTTTCTAAAATTTCTGATGAGCGT
GATGTCGAAGGTGCAGAAATGATTAAAAGTGAGGATGTAGATTGGGATACACTATTATCTTCGTAGGAAAGGTTATTAGA
AAAGACCTTCCTGACCTTCCAAAAACAATAAGATTGAGGGTCGTAAATGCGATAAATGAAAGACTTACAGTTGATCCAAT
GAACCTCGGTGAACCATTGCACCACAGCTTAAAAGGACGTAGAAGGTTAAGAGTTGGTGAGTACCGTGTGATTTACAGAG
TAAATCAATTAGAACAAATAGTGACCATCACAGAGATAGGACATAGATGTGATATTTATAAAAAATAAATATAAACATTA
AATATACATTATACAGCCCTTCTTCCTGATCCATTCTTCACCTTAGGCGGCTTTGCTCAGCTCAAAAATAACTATTTTTT
TGGGTACCCCCCACTATAAGTTATATTATATTATATTATTTATATATAATATAATATAACTCATTGGGAGAGGCGACGAA
AACAAGATCTTGAGTTTGAGGAGCCTCCGGTTTTAGGTGAAGAATGGATCAGGAAGAAGGGCTCAGGGTCGATGAGGACG
TAAGGCCAATATTTGATTATCTCAAAGACTACAGAGACTACATCGATAAACGGCTAGCAAATCCCGACGAAATAGAAGGA
ATAACCACAGGAATAAAGGAGCTTGACACAATAACAGGTGGCTTCAAGGAAGGAGAGCTCTCTGTGCTTGCAGCACGCAC
CTCAATTGGCAAGACTTCAATAGCACTACACATGGCTCTAGAGGCCGCAAAGCTCTTAAATGACCACGAGTATGTCTGCT
TCTTCTCCCTTGAAATGTCAGCCAATCAACTCATCAACAGAATAATCAGCATCAAAGTCAATGAGACAGTCAACACTATC
ATTAAAAACAAG
We stored this sequence in our work directory as pWCP.fa
for convenience.
DATA doi:10.6084/m9.figshare.6380015 gives access to four FASTA files that contain sequences for the plasmid sequence recovered from each individual mosquito metagenomes in our study.
Investigating the circular and extrachromosomal nature of pWCP with long reads
While the assembly and metagenomic read recruitment results suggested that pWCP was likely an extrachromosomal circular element, these observations could not rule out an alternative hypothesis that was brough up by one of our reviewers: the potential occurrence of pWCP in the Wolbachia genome as tandem copies.
To investigate the extrachromosomal nature of pWCP independent of strategies that rely on PCR and short-read assembly, we embarked on a journey to generate long reads from our mosquito samples using a minION sequencer.
We hypothesized that if pWCP was indeed an extrachromosomal element, it should never occur in long reads where pWCP is flanked by recgions that match to the Wolbacha chromosome. This hypothesis could be tested rather confidently with minION reads since the sequencing strategy behind this technology relies on the interpretation of blockade signals as single molecules go through nanopores.
Please see our study for molecular protocols for sample preparation and seqeuncing.
Once the minION run was completed, we moved the output directory with raw sequences into our work directory, and called bases the following way using Albacore, Oxford Nanopore’s bascalling software suite (please note that we only focused on the reads that were reported in the pass
directory):
# process the minION raw output for passed reads
read_fast5_basecaller.py --input fast5/pass \
--recursive \
--barcoding \
--worker_threads 7 \
--save_path Culex_pipiens_minION
DATA doi:10.6084/m9.figshare.7306784 gives access to the de-multiplexed minION raw reads.
Then we created a single FASTQ file for all reads associated with all barcodes (we had multiplexed multiple samples that were coming from different mosquitoes):
cat Culex_pipiens_minION/workspace/pass/*/*.fastq > minion_passed.fastq
We then converted the resulting FASTQ file into a FASTA file, and retained only reads that were longer than 5,000 nucleoties:
iu-fastq-to-fasta minion_passed.fastq \
-o minion_passed.fa
anvi-script-reformat-fasta minion_passed.fa \
--min-len 5000 \
-o minion_passed_5K.fa
Input ........................................: minion_passed.fa
Output .......................................: minion_passed_5K.fa
Minimum length ...............................: 5,000
Total num reads.. ............................: 76,450
Total num nucleotides ........................: 228,869,370
Reads removed ................................: 61,642 (80.63% of all)
Nucleotides removed ..........................: 76515830 (33.43% of all)
Deflines simplified ..........................: False
This gave us a total of 14,808 high-quality reads (whatever that means in the context of minION reads) that were 5,000 nts or longer.
To search for pWCP in this collection of reads, we created a BLAST database from the minION passed reads:
makeblastdb -in minion_passed_5K.fa -dbtype nucl
And finally BLAST searched one of the pWCP sequences (in fact the one against the long-read database:
blastn -query pWCP.fa \
-db minion_passed_5K.fa \
-evalue 1e-20 \
-outfmt 5 \
-out pWCP_vs_minION_reads.xml
The pWCP_vs_minION_reads.xml
, a copy of which is here, contained BLAST hits for each match to any part of the pWCP among minION long reads. We then visualized this output using Kablammo. The summary figure that shows how the long reads align to the pWCP artifically circularized genome suggested that pWCP, which carried the functinal hallmarks of a plasmid, was an extrachromosomal and circular element:
The resulting hits also contained multiple matches between the IS element (which was the first gene in our artificailly circularized sequence for pWCP) and long reads that match to the Wolbachia genome such as this one:
Read recruitment experiments with pWCP
We searched pWCP across our and other publicly available metagenomes to investigate its occurrence and manifestations.
In addition to our metagenomes, we downloaded the following publicly available metagenomes:
Accession ID | Description |
---|---|
SRR5810516 | wPipIV_Istanbul Wolbachia strain endosymbiont of Culex pipiens |
SRR5810517 | wPipIV_Harash Wolbachia strain endosymbiont of Culex pipiens |
SRR5810518 | wPipI_Tunis Wolbachia strain endosymbiont of Culex pipiens |
SRR5179421 | Wolbachia endosymbiont of Laodelphax striatella |
SRR018517 | Drosophila melanogaster DGRP-313 Mate-Pair |
SRR1238105 | Environmental/Metagenome sample from Anopheles gambiae |
SRR1238106 | Environmental/Metagenome sample from Anopheles gambiae |
The plasmid did not recruit even a single read from anything but C. pipiens metagenomes, hence, we only included the first three metagenomes for downstream analyses for brevity.
We first acquired the data:
for ACC in SRR5810516 SRR5810517 SRR5810518
do
# download the interleaved FASTQ file from the NCBI
fastq-dump $ACC
# de-interleave the file:
./deinterleave_fastq.sh < $ACC.fastq "$ACC"_R1.fastq "$ACC"_R2.fastq
# remove the interleaved file
rm $ACC.fastq
done
We later realized that the Istanbul, Harash, and Tunis samples were extremely efficiently sequenced, hence there were tremendous amount of Wolbachia reads in these files, reaching to the limits of samtools.
After inspecting the coverage of single-copy core genes of the Wolbachia population genome of O11
in these samples, we subsampled these reads by manually ajusting the number reads to be kept from each metagenome so the coverage of Wolbachia between all metagenomes are comparable:
for RX in R1 R2
do
head -n $((4455562 * 8)) SRR5810516_"$RX".fastq > SRR5810516_SUBSAMPLED_"$RX".fastq
head -n $((5164154 * 6)) SRR5810518_"$RX".fastq > SRR5810518_SUBSAMPLED_"$RX".fastq
head -n $((4296013 * 8)) SRR5810517_"$RX".fastq > SRR5810517_SUBSAMPLED_"$RX".fastq
done
We then prepared the following samples.txt:
sample r1 r2
Pipeins_MGx_Istanbul SRR5810516_SUBSAMPLED_R1.fastq SRR5810516_SUBSAMPLED_R2.fastq
Pipeins_MGx_Tunis SRR5810518_SUBSAMPLED_R1.fastq SRR5810518_SUBSAMPLED_R2.fastq
Pipiens_MGx_Harash SRR5810517_SUBSAMPLED_R1.fastq SRR5810517_SUBSAMPLED_R2.fastq
Pipiens_MGx_O03 01_QC/Culex_O03-QUALITY_PASSED_R1.fastq 01_QC/Culex_O03-QUALITY_PASSED_R2.fastq
Pipiens_MGx_O07 01_QC/Culex_O07-QUALITY_PASSED_R1.fastq 01_QC/Culex_O07-QUALITY_PASSED_R2.fastq
Pipiens_MGx_O11 01_QC/Culex_O11-QUALITY_PASSED_R1.fastq 01_QC/Culex_O11-QUALITY_PASSED_R2.fastq
Pipiens_MGx_O12 01_QC/Culex_O12-QUALITY_PASSED_R1.fastq 01_QC/Culex_O12-QUALITY_PASSED_R2.fastq
reference path
pWCP pWCP.fa
and the config.json:
{
"samples_txt": "samples-for-plasmid-recruitment.txt",
"fasta_txt": "fasta-txt-for-plasmid-recruitment.txt",
"references_mode": true,
"anvi_gen_contigs_database": {
"--project-name": "{group}",
"threads": 5
},
"centrifuge": {
"run": false
},
"anvi_script_reformat_fasta": {
"run": false
},
"iu_filter_quality_minoche": {
"run": false
},
"anvi_profile": {
"threads": 5,
"--sample-name": "{sample}",
"--skip-hierarchical-clustering": true,
"--min-contig-length": "0"
},
"anvi_merge": {
"--sample-name": "{group}",
"--skip-concoct-binning": true
},
"import_percent_of_reads_mapped": {
"run": false,
"threads": ""
},
"references_mode": true,
"output_dirs": {
"CONTIGS_DIR": "03_CONTIGS"
}
}
We finally run the the anvi’o metagenomics workflow for reference-based read recruitment, which looked like this:
anvi-run-workflow -w metagenomics \
-c config.json \
--additional-params \
--cluster 'clusterize -log {log} -n {threads}' \
--resources nodes=40 \
--jobs 10 \
--rerun-incomplete
When this process was complete, we run anvi’o on the merged profile database:
anvi-interactive -p 06_MERGED/pWCP/PROFILE.db \
-c 03_CONTIGS/pWCP-contigs.db
And visualized the coverage of the plasmid across metagenomes:
Meren’s note for the curious and attentive: Where are those single-nucleotide variants are coming from? Why does O11 and O12 are lacking any? Why is there a slight bump in coverage in O03 and a slight drop in O07? If you can answer all these questions and send us an e-mail Meren will print your name, put a crown on top of it, display it on his office door for one month, and tell anyone who asks about its meaning how awesome you are.
DATA doi:10.6084/m9.figshare.6380615 serves the anvi’o merged profiles to reproduce the display above and access to the coverage of pWCP across metagenomes.
Computing the Wolbachia metapangenome
In order to put our four Wolbachia genomes in the context of the wPip reference Wolbachia genome from C. pipiens, we computed a pangeome.
Downloading wPip reference genome
For this, we first downloaded the wPip reference genome from the NCBI by visiting this address, and generated an anvi’o contigs database for it:
# simplify deflines in reference FASTA
anvi-script-reformat-fasta NC_010981.1.fa \
--simplify-names \
-o wPip.fa
# generate a contigs database
anvi-gen-contigs-database -f wPip.fa \
-o wPip.db
# run HMMs
anvi-run-hmms -c wPip.db
Computing the initial pangeome
We then generated anvi’o genomes storage by accessing our Wolbachia metagenome-assembled genomes through an ‘internal genomes.txt’ file:
name | bin_id | collection_id | profile_db_path | contigs_db_path |
---|---|---|---|---|
Wolbachia_O03 | Wolbachia | FINAL | 06_MERGED/Culex_O03/PROFILE.db | 03_CONTIGS/Culex_O03-contigs.db |
Wolbachia_O07 | Wolbachia | FINAL | 06_MERGED/Culex_O07/PROFILE.db | 03_CONTIGS/Culex_O07-contigs.db |
Wolbachia_O11 | Wolbachia | FINAL | 06_MERGED/Culex_O11/PROFILE.db | 03_CONTIGS/Culex_O11-contigs.db |
Wolbachia_O12 | Wolbachia | FINAL | 06_MERGED/Culex_O12/PROFILE.db | 03_CONTIGS/Culex_O12-contigs.db |
And the external genome, wPip
, through an ‘external-genomes.txt’ file:
name | contigs_db_path |
---|---|
wPip | wPip.db |
# creata an anvi'o genomes storage
anvi-gen-genomes-storage --internal internal-genomes.txt \
--external external-genomes.txt \
-o WOLBACHIA-GENOMES.db
# compute the pan genome
anvi-pan-genome -g WOLBACHIA-GENOMES.db \
-o Wolbachia-Pan
# add a default collection that includes all gene clusters, so there is
# a collection called 'DEFAULT' with a bin called 'EVERYTHING' that
# describes all gene clusters
anvi-script-add-default-collection -p Wolbachia-Pan/Wolbachia-Pan-PAN.db
# Summarize the DEFAULT collection
anvi-summarize -p Wolbachia-Pan/Wolbachia-Pan-PAN.db \
-g WOLBACHIA-GENOMES.db \
-C DEFAULT \
-o Wolbachia-Pan/SUMMARY
# unzip the summary output for downstream analyses
gzip -d Wolbachia-Pan/SUMMARY/Wolbachia-Pan_gene_clusters_summary.txt.gz
This process resulted in a basic pan genome that we could visualize the following way:
anvi-display-pan -p Wolbachia-Pan/Wolbachia-Pan-PAN.db \
-g WOLBACHIA-GENOMES.db
We set to improve this pangenome with additional data layers, including a data layer that marks gene clusters that carry genes that resolve to prophages identified in wPip reference genome, and layers that show the coverage of each gene cluster across metagenomes.
Following chapters describe the details of these steps.
Identifying genes that correspond to WO prophages
To identify gene clusters in the pangenome that correspond to prophages identified in wPip reference genome, we first identified gene caller ids in wPip contigs dataase that match to WO prophages.
We first exported all gene calls from the wPip reference genome:
anvi-get-dna-sequences-for-gene-calls -c wPip.db \
-o Wpip_genes.fa
We acquired the five WO prophage genomes from Klasson et al, 2018, named it ‘WO_Klasson_et_al_2008.fa’, and created a BLAST database for it:
makeblastdb -in WO_Klasson_et_al_2008.fa -dbtype nucl
We then identified wPip genes that match to WO prophages, and kept only the best hits:
# blast serach (note the extended outfmt that is necessary to compute
# the percent query alignment later)
blastn -query Wpip_genes.fa \
-db WO_Klasson_et_al_2008.fa \
-outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen' \
-out WO_in_wPip_hits.txt
# add percent alignment of the query as a new column
awk '{print $0 "\t" (($8 - $7) * 100.0) / $13}' WO_in_wPip_hits.txt > temp.txt
mv temp.txt WO_in_wPip_hits.txt
# generate an empty output file with the correct header:
echo -e "gene_callers_id\tWO_assignment\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore\tqlen\tslen\tpct_alignment" \
> WO_in_wPip_best_hit.txt
# sort the BLAST output to get the best hit for each gene call
# NOTE: the following command works on linux systems but fails on
# Mac systems
sort -k1,1 -k12,12gr -k11,11g -k3,3gr WO_in_wPip_hits.txt |\
sort -u -k1,1 --merge >> WO_in_wPip_best_hit.txt
We kept the resulting file, WO_in_wPip_best_hit.txt, to merge it later with the rest of the information with which we wished to decorate the final pangenome (namely the coverage stuff, but you already knew that since you are following very carefully).
Recovering coverage values for gene clusters of the Wolbachia pangenome in C. pipiens metagenomes
The pangenome reveals the relationship between genes across genomes by identifying gene clusters, so one can learn which gene is specific to one genome and which ones are shared across genomes. However, it deos not communicate the coverage of genes found in a gene cluster in corresponding metagneomes from which the genomes that carry them originate. Fortunately, linking pangenomes and metagenomes can offer such critical insights, and here we decided to extend our pangenome using a strategy to show average coverages of genes in gene clusters across metagenomes.
The following R script, gen-additional-gene-cluster-data.R, recovers the coverage values per gene of interest using the summary of anvi’o profile databases, combines this information with WO annotations, and creates a final file that could be imported into the anvi’o pan database for visalization:
#!/usr/bin/env Rscript
library("dplyr")
# read the gene clusters summary table which is the output of
# anvi-summarize command run prior:
gene_clusters_df <- read.table(file = 'Wolbachia-Pan/SUMMARY/Wolbachia-Pan_gene_clusters_summary.txt',
header = TRUE,
sep = "\t",
quote = "")
# remove singleton gene clusters unique to wPip since they have no coverage values
gene_clusters_df_of_wPip <- gene_clusters_df[gene_clusters_df$genome_name == "wPip", ]
# read the blast output for best WO hits among wPip genes
WO_in_wPip_df <- read.table(file = 'WO_in_wPip_best_hit.txt',
header = TRUE, sep = "\t", quote = "")
# learn which gene cluster ids corrpespond to wPip gene calls
gene_clusters_df_of_wPip_phages=left_join(gene_clusters_df_of_wPip,
WO_in_wPip_df,
by="gene_callers_id")
# keep only columns of interest
gene_clusters_df_of_wPip_phages_short <- gene_clusters_df_of_wPip_phages[, colnames(gene_clusters_df_of_wPip_phages) %in% c('WO_assignment', 'pct_alignment', 'gene_cluster_id')]
# now we need to generate coverage values for each gene cluster.
# O07 had the most number of gene clusters, presenting a meaningful
# target to have coverage values for as many gene clusters as possible.
# here, we first get the gene clusters for O07
gene_clusters_df_of_O07 <- gene_clusters_df[gene_clusters_df$genome_name == "Wolbachia_O07", ]
# then remove all columns but two
gene_clusters_df_of_O07 <- gene_clusters_df_of_O07[, colnames(gene_clusters_df_of_O07) %in% c('gene_cluster_id', 'gene_callers_id')]
# learn gene coverages for all four Wolbachia genomes in the O07 metagenome using
# the summary of the merged profile database for O07:
gene_coverages_df_of_O07 <- read.table(file='06_MERGED/Culex_O07/SUMMARY/bin_by_bin/Wolbachia/Wolbachia-gene_coverages.txt',
header=TRUE, sep="\t", quote="")
# anvi'o adds ribosomal RNA genes ad hoc after HMM runs, so they appear among gene calls but not in the
# pangenome which only uses ORFs identified through prodigal. we will remove them bofore continuing:
missing_genes <- setdiff(gene_coverages_df_of_O07$gene_callers_id,
gene_clusters_df_of_O07$gene_callers_id)
gene_coverages_df_of_O07 <- gene_coverages_df_of_O07[!(gene_coverages_df_of_O07$gene_callers_id %in% missing_genes), ]
# now we can merge the two;
gene_clusters_df_of_O07_with_covs=full_join(gene_clusters_df_of_O07,
gene_coverages_df_of_O07,
by="gene_callers_id")
# the following takes the mean coverage of genes per gene cluster
Mean_coverage_gene_clusters_df_of_O07 <- gene_clusters_df_of_O07_with_covs %>%
group_by(gene_cluster_id) %>%
summarize(mean(Wolbachia_O03),
mean(Wolbachia_O07),
mean(Wolbachia_O11),
mean(Wolbachia_O12))
# Join WO assignment and Coverage so that we get GC common to wPip and
# O07 with 4 additional layers
gene_clusters_df <- full_join(gene_clusters_df_of_wPip_phages_short,
Mean_coverage_gene_clusters_df_of_O07,
by="gene_cluster_id")
# give better names for each column
names(gene_clusters_df) <- c('gene_cluster_id',
'WO_assignment',
'pct_alignment',
'cov_in_003',
'cov_in_007',
'cov_in_011',
'cov_in_012')
# Here we have a data frame that lists for each gene cluster (1) the mean coverage values
# per gene per genome, (2) WO assignments if genes match to wPip prophages, and (3) the percent
# alignment of the query gene seqeunce so we can filter out weak hits. the following
# section of the code works on this data frame to produce a final additional item
# data file to be incorporated into the pan genome.
# a better max function that the default max function. just ignore:
better_max <- function(x) ifelse( !all(is.na(x)), max(x, na.rm=T), NA)
output_df <- data.frame(gene_cluster_id=character(),
WO_assignmnet=character(),
pct_alignment=numeric(),
cov_in_003=numeric(),
cov_in_007=numeric(),
cov_in_011=numeric(),
cov_in_012=numeric(),
stringsAsFactors=FALSE)
# here we go through every gene cluster, pick the most frequent WO assignment if there are any,
# and then discard the ones resulted from hits where less than 90% of the query sequence
# were aligned
N = 1
for (gene_cluster in levels(gene_clusters_df$gene_cluster_id)){
# gene_cluster <- "GC_00001082"
df <- gene_clusters_df[gene_clusters_df$gene_cluster_id == gene_cluster, ]
freqs <- as.data.frame(table(df$WO_assignment))
freqs <- freqs[freqs$Freq > 0, ]
most_frequent_assignment <- as.character(freqs[order(c(-freqs$Freq)), ]$Var1[1])
best_pct_alignment_score <- better_max(df$pct_alignment)
# remove weak WO assignments
if (!is.na(df$pct_alignment[1])){
if (df$pct_alignment[1] < 90){
most_frequent_assignment <- NA;
}
}
output_df[N, ] <- c(gene_cluster,
most_frequent_assignment,
df$pct_alignment[1],
df$cov_in_003[1],
df$cov_in_007[1],
df$cov_in_011[1],
df$cov_in_012[1])
N = N + 1
}
# store the output data frame as a TAB delimited file so we can import it back
# into the anvi'o additional item data table
write.table(output_df, "gene_clusters_additional_data.txt", quote=FALSE, sep="\t", na="", row.names=FALSE)
Running this R script generates a new file, gene_clusters_additional_data.txt, which has this format that is compatible with the anvi’o framework for additional data tables:
gene_cluster_id | WO_assignmnet | pct_alignment | cov_in_003 | cov_in_007 | cov_in_011 | cov_in_012 |
---|---|---|---|---|---|---|
GC_00000001 | WOPip1 | 99.8308 | 292.142949713 | 278.8739736035 | 438.057412722333 | 141.646721645067 |
GC_00000002 | WOPip1 | 99.8855 | ||||
GC_00000003 | WOPip1 | 99.8761 | 709.72681333425 | 628.11843941625 | 1261.49866029432 | 463.770814102825 |
GC_00000004 | WOPip2 | 99.9831 | 293.0934284405 | 292.01863000825 | 590.103568904 | 206.31974780075 |
GC_00000005 | WOPip1 | 395.405748663 | 355.5891902215 | 745.3106378915 | 248.0930258465 | |
GC_00000006 | 258.4480680715 | 210.105256783 | 468.9943071705 | 150.1461058625 | ||
GC_00000007 | 38.5289 | 177.700485876275 | 179.2137872846 | 368.3048043785 | 129.04068234665 | |
GC_00000008 | 96.469323213275 | 90.5922675522 | 193.55186590775 | 84.19663187855 | ||
GC_00000009 | WOPip1 | 99.9295 | 424.62004662 | 398.405594406 | 908.107226107 | 282.895104895 |
GC_00000010 | 22.2944 | 159.644804793775 | 176.11177187325 | 273.500273464 | 112.942770477667 | |
GC_00000011 | 225.752690422 | 244.848198573 | 471.335971987667 | 162.687675610667 | ||
GC_00000012 | WOPip1 | 95.977 | 1039.549962921 | 862.496477567 | 1775.790507975 | 814.773359288 |
GC_00000013 | WOPip1 | 99.8904 | 507.575734139667 | 482.993741871667 | 1040.484991505 | 398.882894950333 |
GC_00000014 | WOPip4 | 99.9328 | 124.619265233 | 143.4063620075 | 285.439516129 | 109.7536738349 |
GC_00000015 | WOPip1 | 99.8788 | 714.0546127935 | 648.0166105485 | 1600.50512907 | 609.646700337 |
GC_00000016 | WOPip5 | 99.7268 | 442.734972678 | 426.12568306 | 802.721311475 | 310.756830601 |
GC_00000017 | WOPip4 | 99.8978 | 268.2677769445 | 264.7108147735 | 641.819649656 | 207.1309664715 |
GC_00000018 | WOPip4 | 99.8538 | 505.152685391967 | 638.537152008667 | 1442.48465139133 | 553.198915989333 |
GC_00000019 | ||||||
GC_00000020 | WOPip1 | 99.8004 | 684.9813806125 | 627.415566457 | 1579.113447805 | 598.094786331 |
GC_00000021 | WOPip1 | 99.7849 | 307.589886039667 | 283.638817663667 | 619.110042735 | 239.1747863247 |
GC_00000022 | WOPip1 | 99.8505 | 284.871449925 | 256.569506726 | 894.059790732 | 406.463378176 |
GC_00000023 | 245.821474616 | 224.2942178115 | 528.8888686175 | 182.1940896965 | ||
GC_00000024 | 272.153092006 | 234.3304282645 | 509.272853421 | 179.73128845 | ||
(…) | (…) | (…) | (…) | (…) | (…) | (…) |
Importing additional data into the pangenome and finalizing it
We the newly generated additional data into our pangenome:
anvi-import-misc-data gene_clusters_additional_datax.txt \
-p Wolbachia-Pan/Wolbachia-Pan-PAN.db \
--target-data-table items
And visualized the pangenome again with all the new additional data layers we just imported:
anvi-display-pan -p Wolbachia-Pan/Wolbachia-Pan-PAN.db \
-g WOLBACHIA-GENOMES.db
We then manually organized the order of layers, changed their colors (since we believe it is important to polish complex visualizations to help the viewer), and made selections of gene clusters based on their distribution patterns across our five genomes to achieve this display:
We stored our visual language as an anvi’o state file, and selections as an anvi’o collection and collection info files. These files can be imported into a blank pan profile database with programs anvi-import-state
and anvi-import-collection
.
As our main figure, we used the “forced synteny” order for our gene clusters to show them in the context of the wPip reference genome (by selecting Forced synteny <> wPip:NA:NA
from the “Items Order” combo box in the main tab of the interactive interface):
DATA doi:10.6084/m9.figshare.6291650 serves the anvi’o databases to reprduce the figure above and to investigate the pangenome.
A note on reproducibility
If you were to download the pangenome from the archive above, you could use the program anvi-display-pan
on a computer with anvi’o v5
installation, to acquire an interactive display and inspect the pangnome interactively.
Here is a very quick demonstration you can watch in full-screen:
The archive file also includes an additional directory that contains the static summary of the pangenome along with the gene cluster membership information so you can access every amino acid sequence for every gene.
Yes. We also are very happy that it is now 2018 and we have all these tools to share our science.
Computing the average nucleotide identity between Wolbachia genomes
For this, we first generated anvi’o contigs databases for each of our Wolbachia MAG using the previously generated summary outputs:
for s in O03 O07 O11 O12
do
anvi-gen-congtigs-database -f 06_MERGED/Culex_"$s"/SUMMARY/bin_by_bin/Wolbachia/Wolbachia-contigs.fa \
-o Wolbachia_"$s".db
done
Created an external genomes file:
name | contigs_db_path |
---|---|
Wolbachia_O03 | Wolbachia_O03.db |
Wolbachia_O07 | Wolbachia_O07.db |
Wolbachia_O11 | Wolbachia_O11.db |
Wolbachia_O12 | Wolbachia_O12.db |
wPip | wPip.db |
And run anvi-compute-ani
, which uses PyANI by Pritchard et al:
anvi-compute-ani --external-genomes genomes-for-ani.txt \
--output-dir ANI \
--num-threads 8
If you are using anvio v6 or later, anvi-compute-ani
has been replaced by anvi-compute-genome-similarity
. The above command remains the same otherwise.
Which returned the following results for percent identity:
Wolbachia_O03 | Wolbachia_O07 | Wolbachia_O11 | Wolbachia_O12 | wPip | |
---|---|---|---|---|---|
Wolbachia_O03 | 100.00% | 99.95% | 99.85% | 99.90% | 99.79% |
Wolbachia_O07 | 99.97% | 100.00% | 99.89% | 99.92% | 99.78% |
Wolbachia_O11 | 99.93% | 99.94% | 100.00% | 99.98% | 99.85% |
Wolbachia_O12 | 99.85% | 99.82% | 99.86% | 100.00% | 99.72% |
wPip | 99.26% | 99.30% | 99.27% | 99.10% | 100.00% |
And alignment coverage:
Wolbachia_O03 | Wolbachia_O07 | Wolbachia_O11 | Wolbachia_O12 | wPip | |
---|---|---|---|---|---|
Wolbachia_O03 | 100.00% | 99.46% | 96.62% | 98.10% | 99.04% |
Wolbachia_O07 | 91.56% | 100.00% | 89.62% | 90.95% | 91.73% |
Wolbachia_O11 | 97.04% | 97.71% | 100.00% | 98.90% | 98.48% |
Wolbachia_O12 | 96.30% | 96.96% | 96.54% | 100.00% | 98.49% |
wPip | 81.39% | 81.71% | 80.50% | 83.40% | 100.00% |
Computing the density of single-nucleotide variants for Wolbachia populations
To what extent the Wolbachia genomes we recovered represents mono-clonal populations? Or is there a great deal of diversity of Wolbachia populations in these metagenomes? To address these questions we generated single-nucleotide variant profiles for Wolbachia genomes using metagenomes from which they were reconstructed:
# generate variability profiles for each Wolbachia genome
for s in O03 O07 O11 O12
do
echo Culex_"$s" > sample-of-interest.txt
anvi-gen-variability-profile -p 06_MERGED/Culex_"$s"/PROFILE.db \
-c 03_CONTIGS/Culex_"$s"-contigs.db \
-C FINAL \
-b Wolbachia \
--samples-of-interest sample-of-interest.txt \
-o variability_"$s".txt
done
# combine all variability information into a single file:
head -n 1 variability_003.txt > variability-all.txt
cat variability_0* | grep -v entry_id >> variability-all.txt
# generate a text file with genome lenghts of Wolbachia and samples
# from which they were generated:
echo -e "metagenome\tgenome_length" > genome-lengths.txt
for s in O03 O07 O11 O12
do
len=`sqlite3 Wolbachia_"$s".db 'select value from self where key=="total_length";'`
echo -e "Culex_$s\t$len" >> genome-lengths.txt
done
The following R script, variability-analysis.R, calculated the percentage of SNVs per Wolbachia genome using the variability-all.txt and genome-lengths.txt files:
#!/usr/bin/env Rscript
suppressMessages(library("dplyr"))
# read in the genome lenghts per Wolbachia genome
genome_lengths <- read.table(file = 'genome-lengths.txt',
header = TRUE,
sep = "\t",
quote = "")
# read in the concatenated variability profiles for each
# of the four metagenomes generated by the anvi'o program
# anvi-gen-variability-profile
variability <- read.table(file = 'variability-all.txt',
header = TRUE,
sep = "\t",
quote = "")
# keep only single-nucleotide variants that fall into the
# context of complete gene calls to avoid dubious contig
# ends where variants often emerge due to mappig artifacts
df <- variability[variability$in_complete_gene_call == "1" &
variability$departure_from_consensus > 0.1, ]
# count how many SNVs you have per metagenome
num_snvs <- as.data.frame(table(df$sample_id))
names(num_snvs) <- c('metagenome', 'num_snvs')
# merge it with genome lengths
snvs=full_join(num_snvs, genome_lengths, by="metagenome")
# calculate the percent SNVs per metagenome
snvs$percent_snvs <- snvs$num_snvs / snvs$genome_length * 100
# add position info
snvs$in_first <- 0
snvs$in_second <- 0
snvs$in_third <- 0
for(mg in snvs$metagenome){
arr <- df[df$sample_id == mg, ]$base_pos_in_codon
snvs[snvs$metagenome == mg, ]$in_first <- length(arr[arr == 1]) * 100 / length(arr)
snvs[snvs$metagenome == mg, ]$in_second <- length(arr[arr == 2]) * 100 / length(arr)
snvs[snvs$metagenome == mg, ]$in_third <- length(arr[arr == 3]) * 100 / length(arr)
}
# print it nicely
cat(sprintf("
=======================================================================
Percent SNV density for each Wolbachia genome in metagenomes from which
they were recovered, the average coverage of nucleotide positions at which
SNVs were observed, and the distribution of SNVs across codon positions
=======================================================================\n"))
for(mg in snvs$metagenome){
cat(sprintf("Wolbachia in %s: %.2f%% (%d SNVs with an avg. cov. of %dX;
%.1f%% in 1st,
%.1f%% in 2nd,
%.1f%% in 3rd nt pos).\n\n",
mg,
snvs[snvs$metagenome == mg, ]$percent_snvs,
snvs[snvs$metagenome == mg, ]$num_snvs,
round(mean(df[df$sample_id == mg, ]$coverage)),
snvs[snvs$metagenome == mg, ]$in_first,
snvs[snvs$metagenome == mg, ]$in_second,
snvs[snvs$metagenome == mg, ]$in_third))
}
Which offers this output when run:
$ ./variability-analysis.R
=======================================================================
Percent SNV density for each Wolbachia genome in metagenomes from which
they were recovered, the average coverage of nucleotide positions at which
SNVs were observed, and the distribution of SNVs across codon positions
=======================================================================
Wolbachia in Culex_O03: 0.01% (168 SNVs with an avg. cov. of 419X;
31.0% in 1st,
23.8% in 2nd,
45.2% in 3rd nt pos).
Wolbachia in Culex_O07: 0.05% (612 SNVs with an avg. cov. of 519X;
31.7% in 1st,
25.5% in 2nd,
42.8% in 3rd nt pos).
Wolbachia in Culex_O11: 0.02% (203 SNVs with an avg. cov. of 835X;
31.0% in 1st,
22.7% in 2nd,
46.3% in 3rd nt pos).
Wolbachia in Culex_O12: 0.02% (257 SNVs with an avg. cov. of 303X;
28.4% in 1st,
20.6% in 2nd,
51.0% in 3rd nt pos).
Please feel free to get in touch with us if you have any questions or need help resolve poorly explained parts via julie.reveillaud [at] cirad.fr or meren [at] uchicago.edu.