Microbial 'omics

Public Data and Reproducible Bioinformatics

This page serves the publicly available data items mentioned in our publications. Please do not hesitate to get in touch if something is missing.

Questions? Concerns? Find us on

Table of Contents

A Wolbachia Plasmid

The Wolbachia mobilome in Culex pipiens includes a putative plasmid.

Reveillaud, R., Bordenstein, S. R., Cruaud, C., Shaiber, A., Esen, Ö. C., O. C., Weill, M., Makoundou, P., Lolans, K., Watson, A. R., Rakotoarivony, I., Bordenstein, S. R., and Eren, A. M.
- The first report of a Wolbachia plasmid through genome-resolved metagenomics on microsurgically removed individual mosquito ovary samples (peer reviews and responses).
- Yet another application of metapangenomics and an applicatoin of minION long-read sequencing on extremely low-biomass samples.
- Reproducible bioinformatics workflow with all data items, and a 'behind the paper' blog post by Julie Reveillaud.
Nature Communications. 10:1051

http://merenlab.org/data/wolbachia-plasmid gives access to our complete bioinformatics workflow.

Public data items for the study:

The Prochlorococcus Metapangenome

Linking pangenomes and metagenomes: the Prochlorococcus metapangenome.

Delmont, T. O., and Eren, A. M.
- A big-data study in which a pangenome of 31 Prochlorococcus isolates meets 31 billion Tara Oceans metagenomic sequences (Peer-review history).
- Metapangenomes reveal to what extent genes that may be linked to the ecology and fitness of microbes are conserved within a phylogenetic clade.
- Reproducible bioinformatics workflow.
PeerJ. 6:e4320

http://merenlab.org/data/prochlorococcus-metapangenome gives access to our complete bioinformatics workflow.

Public data items for the study:

Genomes from Tara Oceans Metagenomes

Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in the surface ocean metagenomes.

Delmont, T. O., Quince, C., Shaiber, A., Esen, Ö. C., Lee, S. T. M., Rappé, M. S., McLellan, S. L., Lücker, S., and Eren, A. M.
- First genomic evidence for abundant and widespread non-cyanobacterial nitrogen-fixing populations in the surface ocean.
- Nearly 1,000 non-redundant, high-quality bacterial, archaeal, and eukaryotic population genomes from TARA Oceans metagenomes.
- A "behind the paper" blog post by Meren, a press release by the MBL, and an extensive description of the bioinformatics workflow.
Nature Microbiology. 3:804–813

http://merenlab.org/data/tara-oceans-mags gives access to our complete bioinformatics workflow.

Public data items for the study:

  • The original TARA Oceans metagenomes are available through the European Bioinformatics Institute (ERP001736) and NCBI (PRJEB1787).
  • doi:10.6084/m9.figshare.4902920: Our raw assembly outputs per region.
  • doi:10.6084/m9.figshare.4902917: All amino acid sequences in our raw assemblies.
  • doi:10.6084/m9.figshare.4902923: FASTA files for 957 non-redundant metagenome-assembled genomes.
  • doi:10.6084/m9.figshare.4902941: Self contained anvi’o profiles for each non-redundant MAG (each of which can be visualized interactively through the program anvi-interactive offline).
  • doi:10.6084/m9.figshare.4902926: A static HTML output for the anvi’o merged profile database for non-redundant MAGs (double-click the index.html file after download).
  • doi:10.6084/m9.figshare.4902938: Main and Supplementary Tables and Figures. Which includes Figure 1 (geographically bounded metagenomic co-assemblies), Figure 2 (the nexus between phylogeny and function of HBDs), Figure 3 (Phylogeny of nitrogen fixation genes), Figure 4 (the abundance of nitrogen-fixing populations of Planctomycetes and Proteobacteria across oceans), Supplementary Figure 1 (phylogenetic analysis of nifH genes), Supplementary Table 1 (the summary of the 93 metagenomes from TARA Oceans, and the twelve geographic regions they represent), Supplementary Table 2 (the summary of the co-assembly and binning outputs for each metagenomic set), Supplementary Table 3 (genomic features of 957 MAGs from the non-redundant genomic database including the taxonomy for each MAG, the mean coverage, relative distribution, detection and number of recruited reads for each MAG across the 93 metagenomes, etc), Supplementary Table 4 (the 16S rRNA gene sequence identified in HBD-09), Supplementary Table 5 (genomic features, Pearson correlation (based on the relative distribution in 93 metagenomes), Supplementary Table 6 (RAST subsystems and KEGG modules for the nine HBDs), Supplementary Table 7 (nifH gene sequences in MAGs, orphan scaffolds, as well as the reference sequence γ-24774A11, along with their mean coverage across the 93 metagenomes), Supplementary Table 8 (Genomic features of 30,244 bins manually characterized from the 12 metagenomic sets. Completion and redundancy estimates are based on the average of four bacterial single-copy gene collections), Supplementary Table 9 (KEGG annotation for 1,077 MAGs), and Supplementary Table 10 (Relative distribution of 1,077 MAGs across the 93 metagenomes).
  • An interactive visualization for the phylogenomic analysis of 432 Proteobacteria and 43 Planctomycetes metagenome-assembled genomes from our database of 957 non-redundant MAGs is also available: here.

Genome-resolved Fecal Microbiota Transplantation

Tracking microbial colonization in fecal microbiota transplantation experiments via genome-resolved metagenomics.

Lee, S. T. M., Kahn, S. A., Delmont, T. O., Shaiber, A., Esen, Ö. C., Hubert, N. A., Morrison, H. G., Antonopoulos, D. A., Rubin, D. T., and Eren, A. M.
- An FMT study with metagenome-assembled genomes (see public data).
- Bacteroidales: high-colonization rate. Clostridiales: low colonization rate. Colonization success is negatively correlated with the number of genes related to sporulation.
- MAGs with the same taxonomy showed different colonization properties, highlighting the importance of high-resolution analyses.
- Populations colonized both recipients were also prevalent in the HMP cohort (and the ones that did not, distribute sporadically across the HMP cohort).
Microbiome. 5:50

Public data items for the study:

  • While the accession ID SRP093449 serves serves the raw shotgun metagenomic data through NCBI’s Short Read Archive, this FigShare collection gives access to all public data items detailed below.
  • doi:10.6084/m9.figshare.4792633: Files for anvi’o manual interactive for a quick visualization of the distribution of 92 donor MAGs from Lee and Khan et al study (panel a in the figure above). Follow these steps for a quick interactive visualization:
# download the file
wget https://ndownloader.figshare.com/files/7879036 -O ANVIO-FMT-D-R01-R02-QUICK-VISUALIZATION.tar.gz

# unpack

# go into the directory

# run anvi-interacive
anvi-interactive -p profile.db \
                 -s samples.db \
                 -t tree.txt \
                 -d data.txt \
                 --manual \
                 --title "The distribution of 92 donor MAGs"
  • doi:10.6084/m9.figshare.4792627: A static HTML output for the anvi’o merged profile database for the 92 donor MAGs. This static web site contains FASTA files for MAGs, coverage and detection values, functional annotations, and other essential information.
  • doi:10.6084/m9.figshare.4792621: The full anvi’o merged profile and contigs databases for the 92 donor MAGs (for each MAG gives access to the detailed information shown in panel b and panel c in the figure above). Among many other things you can do with these anvi’o files, you can use the program anvi-refine to study any MAG in the study in detail. MAGs are described in a collection named MAGs. Here is a quick example:
# download the merged anvi'o profile for the study
wget https://ndownloader.figshare.com/files/7879024 -O ANVIO-FMT-D-R01-R02-MERGED-PROFILE.tar.gz

# unpack the archive, and get into the directory

# take a look at the MAGs stored in the profile
anvi-script-get-collection-info -p PROFILE.db -c CONTIGS.db -C MAGs
Auxiliary Data ...............................: Found: CONTIGS.h5 (v. 1)
Contigs DB ...................................: Initialized: CONTIGS.db (v. 8)

Bins in collection "MAGs"
FMT-Donor_MAG_00027 :: PC: 95.98%, PR: 3.30%, N: 129, S: 1,914,574, D: bacteria (0.99)
FMT-Donor_MAG_00069 :: PC: 82.93%, PR: 7.68%, N: 154, S: 1,168,783, D: bacteria (0.91)
FMT-Donor_MAG_00025 :: PC: 96.34%, PR: 3.99%, N: 139, S: 2,740,980, D: bacteria (1.00)
FMT-Donor_MAG_00068 :: PC: 76.07%, PR: 0.72%, N: 109, S: 1,012,764, D: bacteria (0.77)
FMT-Donor_MAG_00041 :: PC: 89.80%, PR: 3.63%, N: 282, S: 2,471,626, D: bacteria (0.93)
FMT-Donor_MAG_00010 :: PC: 100.00%, PR: 4.38%, N: 224, S: 3,125,246, D: bacteria (1.04)

# pick one, and visualize it interactively
anvi-refine -p PROFILE.db -c CONTIGS.db -C MAGs -b FMT-Donor_MAG_00054

# which should play out just like in the following video

This is what you should see after entering the last command in your terminal:

  • doi:10.6084/m9.figshare.4793761: Individual anvi’o profiles for the occurrence of each the FMT donor MAG across 151 HMP gut metagenomes (for each MAG gives access to the information shown in panel d in the figure above).
  • doi:10.6084/m9.figshare.4792645: Individual figures that show the detection of 92 donor MAGs in 151 HMP gut metagenomes.

Bacteroides in Pouchitis

Patient-specific Bacteroides genome variants in pouchitis.

Vineis, J. H., Ringus, D. L., Morrison, H. G., Delmont, T. O., Dalal, S., Raffals, L. H., Antonopoulos, D. A., Rubin, D. T., Eren, A. M., Chang, E. B., and Sogin, M. L mBio. 7(6):e01713-16

Here is a blog post on it: Bacteroides Genome Variants, and a reproducible science exercise with anvi’o.

Data for the paper:

The anvi’o profiles article in this data collection contains 22 items:

For an example on how to re-analyze these anvi’o profiles, please click here.

Tardigrade Assembly Re-analysis

Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies.

Delmont, T. O., and Eren, A. M.
- A holistic approach to visualize and curate genomic and metagenomic assemblies.
- A re-analysis of the first released Tardigrade genome reveals a likely symbiont among other contaminants.
- A practical approach to estimate the number bacterial genomes in an assembly.
PeerJ. 4:e1839
  • This link will download everything necessary to recreate -an unpolished version of- Figure 2 appears in the manuscript, including the run script that will run the process automatically (compatible both with v1 and v2 branches of anvi’o).
  • Following links give access to media files and supplementary tables:
    • Figure 1. Holistic assessment of the tardigrade genome release from Boothby et al. (2015). Dendrogram in the center organizes scaffolds based on sequence composition and coverage values in data from 11 DNA libraries. Scaffolds larger than 40 kbp were split into sections of 20 kbp for visualization purposes. Splits are displayed in the first inner circle and GC-content (0-71%) in the second circle. In the following 11 layers, each bar represents the portion of scaffolds covered by short reads in a given sample. The next layer shows the same information for RNA-Seq data. Scaffolds harboring genes used by Boothby et al. to support the expended HGT hypothesis is shown in the next layer. Finally, the most outer layer shows our selections of scaffolds as draft genome bins: the curated tardigrade genome (selection number 1), as well as three near-complete bacterial genomes originating from various contamination sources (selection number 2, 3, and 4).
    • Figure 2. Occurrence of the 139 bacterial single-copy genes reported by Campbell et al. (2013) across scaffold collections. The top two plots display the frequency and distribution of single-copy genes in the raw tardigrade genomic assembly generated by Boothby et al. (2015), and Koutsovoulos et al. (2015), respectively. The bottom two plots display the same information for each of the curated tardigrade genomes. Each bar represents the squared-root normalized number of significant hits per single-copy gene. The same information is visualized as box-plots on the left side of each plot.
    • Supplementary Figire 1. Visualization and curation of the raw tardigrade genome assembly from Koutsovoulos et al. (2015). In the left panel (curation step I), 24,841 scaffolds that were longer than 1 kbp from the raw assembly were clustered based on sequence composition and coverage values in data from the two Illumina sequencing libraries (the inner dendrogram). Scaffolds longer than 40 kbp were split into sections of 20 kbp for visualization purposes. The second layer shows the GC-content for each scaffold. Next two view layers represent the log-normalized mean coverage values for scaffolds in the two sequencing datasets. Finally, our scaffold selections (tardigrade draft 01 and six bacterial draft genomes) are displayed in the outer layer. In the right panel (curation step II), the 15,839 scaffolds from the tardigrade selection from step I were clustered based on sequence composition only for a more precise curation. Additional scaffold selections (tardigrade draft 02 and two bacterial draft genomes) are displayed in the outer layer.
    • Supplementary Table 1. *Summary of H. dujardini and bacterial genomes identified from the raw assembly results of Boothby et al. (2015) and Koutsovoulos et al. (2015). * Inferred from Boothby et al. (2015) and Koutsovoulos et al. (2015) publications. ** Scores were calculated using bacterial single copy genes from Campbell et al. (2013) and are only used to assess bacterial contamination levels in the eukaryotic assembly results.
    • Supplementary Table 2. Summary of functions identified by RAST in the bacterial draft genome #2 (selection #3 in Fig. 1).
    • Supplementary Table 3. Summary of HMM hits for each bacterial single-copy gene (collection of 139 from Campbell et al. (2013)) identified in 1) the raw assembly by Boothby et al. (2015), 2) the raw assembly by Koutsovoulos et al. (2015), 3) the curated draft genome of Hypsibius dujardini from Boothby et al. assembly in this study, and 4) the curated draft genome of H. dujardini from Koutsovoulos et al. (2015). Everything mentioned on this page can be cited using doi 10.6084/m9.figshare.2067057.

Anvi’o Methods Paper

Anvi’o: An advanced analysis and visualization platform for ‘omics data.

Eren, A. M., Esen, Ö. C., Quince, C., Vineis, J. H., Morrison, H. G., Sogin, M. L., and Delmont, T. O.
- The methods paper for anvi'o.
- Binning, and single-nucleotide variant analysis of a human gut time series metagenome.
- Re-analysis of cultivar genomes, metagenomes, and metatranscriptomes associated with the Deepwater Horizon oil spill.
PeerJ. 6:358

The anvi’o profiles here will run with a much earlier version of anvi’o. If you would like to work with them, please checkout your anvi’o codebase to this commit. Please don’t hesitate to write us if you need assistance.

Daily Infant Gut Samples by Sharon et al.. Raw data and anvi’o results for the section on supervised binning and the analysis of the variability in genome bins.

Pensacola Beach Samples by Overholt et al. and Rodriguez-R et al.. Raw data and anvi’o results for the section on linking cultivar genomes with metagenomes.

  • While this address gives access to the anvi’o summary of the ten cultivar genomes (download), this one serves the 56 metagenomic bins (download) shown in Figure 4.
  • You can download the output of anvi-merge for the mapping of metagenomes to Overholt cultivars from here, and the output for anvi-merge for metagenomic bins is available here.

Gulf of Mexico Samples by Mason et al., and Yergeau et al.. Results for the section on linking metagenomes, metatranscriptomes, and single-cell genomes.

Media and Supplementary files.