Table of Contents
- Downloading and characterizing Kowarsky et al. contigs
- Sequences of ribosomal proteins
- Searching ribosomal proteins in NCBI’s nr database
- Some speculations
- Hypothesis: Oral cavity may be the actual reservoir for this CPR (Concluision: Not likely)
- Clifford Beall reports back: no hits from 20 saliva metagenomes
- Meren confirms: no hits from 14 plaque and tongue metagenomes
- Functions
- Can we say it though?
Twitter is bad. I mostly follow scientists, and often end up running into interesting findings from other groups that make me want to take a quick look at their data. Although most of our procrastinations don’t end up on the blog, sometimes they do: 1, 2, 3. Well, today was one of those days.
The study above reveals microbial findings from the human blood with a potential to start some debates about contamination.
Briefly, Kowarsky et al. take hundreds of blood samples collected from tens of patients, and use shotgun seqeuncing and assembly strategies to recover contigs from cell-free DNA. They remove sequences that match to the human genome, and investigate what is there in the remaining contigs. The authors also validate some of their findings by performing independent bench experiments, which is very nice to see since unfortunately ‘omics findings are rarely validated by additional experiments.
While you may be asking yourself whether some of the findings may have been driven by contamination, I am looking at the authors’ use of the term “dark matter” in their manucript and ask myself when should I move to the countryside and raise chickens. But that’s just me.
Interesting findings in this study that are summarized in its title intrigued me to take another look at their assembly. I was very glad to learn that the authors did make available not only the short reads for metagenomes, but also their assembly results. I thank them very much for their observance of open science practices (which is something that is also not as common as it should be).
Downloading and characterizing Kowarsky et al. contigs
I started by downloading the assembled contigs, and created an anvi’o contigs database. Everything took about 2-3 minutes:
# download the assembled contigs
wget http://www.pnas.org/content/suppl/2017/08/21/1707009114.DCSupplemental/pnas.1707009114.sd05.txt -O Kowarsky_et_al.fa
# generate an anvi'o contigs database
anvi-gen-contigs-database -f Kowarsky_et_al.fa \
-o Kowarsky_et_al.db \
--name 'Kowarsky et al contigs'
# run anvi'o's default HMM profiles on this contigs database
# using 10 cores
anvi-run-hmms -c Kowarsky_et_al.db -T 10
The output from the anvi-run-hmms
step showed 0 rRNA gene hits, which was in agreement with the authors’ statement in the paper. But the same output indicated that there were some bacterial single-copy core genes in the assembly.
Whenever there is a new contigs database, I take a quick look at the occurrence of the number of bacterial and archaeal single-copy core genes, since this is a great way to roughly estimate the number of genomes one should expect to find in an assembly.
Running these two commands,
anvi-script-gen_stats_for_single_copy_genes.py Kowarsky_et_al.db
anvi-script-gen_stats_for_single_copy_genes.R Kowarsky_et_al.db.hits Kowarsky_et_al.db.genes
tells me that the assembly does not contain any highly complete bacterial genomes:
Fine.
But a closer inspection via thr good’ol squinting the eyes technique, reveals the occurrence of a good number of Ribosomal proteins in these contigs (which is also pointed out by the authors in their paper).
Sequences of ribosomal proteins
Next, I ask anvi’o to give me the amino acid sequences of all ribosomal proteins:
# first list all the available gene names in Campbell et al.
# single-copy core gene collection (the output is not shown):
anvi-get-sequences-for-hmm-hits -c Kowarsky_et_al.db \
--hmm-source Campbell_et_al \
--list-available-gene-names
# and from there I select all the ribosomal genes, and ask
# for their amino acid seqeunces to be stored in
anvi-get-sequences-for-hmm-hits -c Kowarsky_et_al.db \
--hmm-source Campbell_et_al \
-o Kowarsky_et_al_SCG_hits.fa \
--get-aa-sequences \
--gene-names "Ribosom_S12_S23, Ribosomal_L1, Ribosomal_L10, Ribosomal_L11, Ribosomal_L11_N, \
Ribosomal_L12, Ribosomal_L13, Ribosomal_L14, Ribosomal_L16, Ribosomal_L17, \
Ribosomal_L18e, Ribosomal_L18p, Ribosomal_L19, Ribosomal_L2, Ribosomal_L20, \
Ribosomal_L21p, Ribosomal_L22, Ribosomal_L23, Ribosomal_L27, Ribosomal_L28, \
Ribosomal_L29, Ribosomal_L2_C, Ribosomal_L3, Ribosomal_L32p, Ribosomal_L35p, \
Ribosomal_L4, Ribosomal_L5, Ribosomal_L5_C, Ribosomal_L6, Ribosomal_L9_C, \
Ribosomal_L9_N, Ribosomal_S10, Ribosomal_S11, Ribosomal_S13, Ribosomal_S15, \
Ribosomal_S16, Ribosomal_S17, Ribosomal_S18, Ribosomal_S19, Ribosomal_S2, \
Ribosomal_S20p, Ribosomal_S3_C, Ribosomal_S4, Ribosomal_S5, Ribosomal_S5_C, \
Ribosomal_S6, Ribosomal_S7"
The resulting FASTA file is here. But I summarized the output for you a tiny bit:
Found in the assembly | e-value | Contig | Hit start | Hit stop |
---|---|---|---|---|
Ribosomal_L20 | 5.5e-36 | PR_node_48 | 6269 | 6617 |
Ribosomal_L32p | 1.1e-09 | PR_node_60 | 1388 | 1538 |
Ribosomal_L9_N | 4.7e-18 | PR_node_197 | 1959 | 2403 |
Ribosomal_L27 | 7.3e-34 | PR_node_197 | 2470 | 2749 |
Ribosomal_S16 | 8.7e-15 | PR_node_199 | 1751 | 2105 |
Ribosomal_L13 | 1.4e-33 | PR_node_260 | 1670 | 2027 |
Ribosomal_L17 | 1.9e-36 | PR_node_260 | 2032 | 2383 |
Ribosomal_L13 | 7.3e-32 | PR_node_287 | 1680 | 2406 |
Ribosomal_S2 | 3.8e-62 | PR_node_296 | 1014 | 1740 |
Ribosomal_L21p | 4.2e-28 | PR_node_340 | 1693 | 2065 |
Ribosomal_L2 | 1.1e-28 | PR_node_444 | 1 | 280 |
Ribosomal_L23 | 1.3e-21 | PR_node_444 | 325 | 619 |
Ribosomal_L4 | 1.1e-60 | PR_node_444 | 628 | 1270 |
Ribosomal_S7 | 8.3e-56 | PR_node_1196 | 0 | 477 |
Ribosomal_S20p | 7.7e-15 | PR_node_1285 | 146 | 410 |
Ribosomal_L10 | 8.6e-23 | PR_node_1290 | 148 | 637 |
Ribosomal_L14 | 5.1e-30 | HT_node_2492 | 11 | 407 |
Ribosomal_S13 | 2.9e-25 | LT_node_1007 | 1 | 250 |
Authors mention that they found ribosomal proteins in 14 contigs, which perfectly matches to the table above: there are 14 distinct contigs that carry ribosomal proteins in my results as well (it is always a great feeling to find perfect agreement with published material).
As you can see from this output, multiple of these hits are occurring in the same contig (i.e., PR_node_260 contains two ribosomal proteins one after another, etc), however, we don’t know whether any of these distinct contigs are coming from the same population.
The supplementary methods suggest that these contigs are assembled independently from each dataset. As a final note on this, in Kowarsky et al.’s notation in the study PR, HT, and LT stand for samples originate from ‘pregnant’, ‘heart transplant’, and ‘lung transplant’ individuals.
So one can say most of the contigs with ribosomal proteins are coming from blood samples of pregnant individuals.
Fine.
Since there isn’t much information on the taxonomic origins of these bacterial hits in Kowarsky et al.’s study, I decided to focus on these ribosomal proteins a bit more.
Searching ribosomal proteins in NCBI’s nr database
I simply went to the NCBI and used blastp
to search those amino acid seqeunces in the database of non-redundant protein sequences.
Once the search was done, I clicked the Download
menu, and selected Multiple-file JSON
option.
I had never tried to make sense of JSON files form NCBI BLAST output before, so I quickly (and dirtily) wrote the following Python program to get the best hit for each query sequence:
#!/usr/bin/env python
import sys
import json
import glob
# poor man's whatever:
QUERY = lambda: hits['BlastOutput2']['report']['results']['search']['query_title'].split('___')[0]
QLEN = lambda: hits['BlastOutput2']['report']['results']['search']['query_len']
HIT = lambda: hits['BlastOutput2']['report']['results']['search']['hits'][index]
DESC = lambda: HIT()['description'][index]
TITLE = lambda: DESC()['title']
SCINAME = lambda: DESC()['sciname']
ACC = lambda: '[%(desc)s](https://www.ncbi.nlm.nih.gov/protein/%(desc)s)' % {'desc': DESC()['accession']}
HSPS = lambda: HIT()['hsps'][0]
PCTALIGN = lambda: str(HSPS()['align_len']* 100 / QLEN()) + '%'
PCTID = lambda: str(HSPS()['identity'] * 100 / HSPS()['align_len']) + '%'
print('|'.join(['', 'Found in the assembly', 'Best hit on NCBI', 'Percent alignment', 'Percent identity', 'Accession', '']))
print('|'.join(['', ':--', ':--', ':--:', ':--:', ':--:', '']))
# go through every json file in the directory:
for j in glob.glob('*.json'):
hits = json.load(open(j))
# skip the poop file
if 'BlastOutput2' not in hits:
continue
# report the best hit:
index = 0
# unless the best hit resolves to a multispecies .. if it does, increment
# index
while 1:
if TITLE().find('MULTISPECIES') == -1:
break
index += 1
print('|'.join(['', QUERY(), SCINAME(), PCTALIGN(), PCTID(), ACC(), '']))
When it is run in a directory of JSON files downloaded form the NCBI, this program produces a markdown-formatted output table, so lazy people like myself can copy-paste it on their blogs:
Found in the assembly | Best hit on NCBI | Percent alignment | Percent identity | Accession |
---|---|---|---|---|
Ribosomal_S2 | Candidatus Taylorbacteria bacterium RIFCSPHIGHO2_01_FULL_46_22b | 96% | 58% | OHA18677 |
Ribosomal_L21p | Parcubacteria (Nomurabacteria) bacterium GW2011_GWD2_39_12 | 83% | 58% | KKR01970 |
Ribosomal_L2 | Parcubacteria (Nomurabacteria) bacterium GW2011_GWE2_36_115 | 100% | 79% | KKP91491 |
Ribosomal_L23 | Candidatus Zambryskibacteria bacterium RIFCSPHIGHO2_02_FULL_38_22 | 97% | 52% | OHA94501 |
Ribosomal_L4 | Candidatus Lloydbacteria bacterium RIFOXYC12_FULL_46_25 | 95% | 63% | OGZ17743 |
Ribosomal_S7 | Candidatus Zambryskibacteria bacterium RIFCSPHIGHO2_01_FULL_39_63 | 98% | 72% | OHA86974 |
Ribosomal_S20p | Parcubacteria bacterium GW2011_GWA2_47_12 | 97% | 61% | KKU59139 |
Ribosomal_L10 | Candidatus Zambryskibacteria bacterium RIFCSPHIGHO2_02_38_10.5 | 99% | 60% | OHA92782 |
Ribosomal_L20 | Candidatus Zambryskibacteria bacterium RIFCSPHIGHO2_01_FULL_43_27 | 96% | 64% | OHA89065 |
Ribosomal_L32p | Candidatus Lloydbacteria bacterium RIFCSPHIGHO2_01_FULL_41_20 | 95% | 44% | OGZ03679 |
Ribosomal_L9_N | Parcubacteria bacterium GW2011_GWA2_49_9 | 99% | 55% | KKW11623 |
Ribosomal_L27 | Parcubacteria (Nomurabacteria) bacterium GW2011_GWF1_34_20 | 96% | 71% | KKP60887 |
Ribosomal_S16 | Parcubacteria (Nomurabacteria) bacterium GW2011_GWF2_36_126 | 91% | 65% | KKP93401 |
Ribosomal_L13 | Candidatus Zambryskibacteria bacterium RIFCSPLOWO2_12_FULL_39_23 | 97% | 66% | OHB12744 |
Ribosomal_L17 | Candidatus Lloydbacteria bacterium RIFCSPHIGHO2_01_FULL_41_20 | 100% | 64% | OGZ03809 |
Ribosomal_L13 | Polaromonas naphthalenivorans CJ2 | 38% | 96% | A1VJJ3 |
Ribosomal_L14 | Rhodospirillales bacterium 69-11 | 81% | 36% | OJW27693 |
Ribosomal_S13 | Enterococcus faecalis TX2141 | 100% | 100% | EFT89376 |
This is a bit crazy.
Amino acid sequences of these ribosomal proteins have reasaonable hits in the NCBI, and most of these hits are most closely related to Candidate Phyla Radiation genomes released in Christopher T. Brown et al.’s 2015 paper and Karthik Anantharaman et al.’s recent publication. Another interesting thing is that except the last two lines in this table, all these genes are coming from contigs found in assemblies of pregnant samples.
Some speculations
Before making any suggestions, there are two points that we should take into consideration.
Is it a single population?
The first point is that all these contigs may be coming from a single microbial population.
In fact, in my opinion, it is very likely since there are only a small number of ribosomal proteins that occur only once.
To test this I did something very quick and dirty with anvi’o. You know, using some very basic characteristics of metagenomic data, we can get genomes from metagenomes rather rapidly. However, these methods work best when ‘coverage’ data is available (see this for why). But even with sequence composition alone, we can see the emergence of bins. Using only this contigs database, I created a blank profile with anvi’o:
anvi-profile -c Kowarsky_et_al.db \
-o PROFILE \
-S Kowarsky_et_al \
--blank
Then I created a simple text file to identify contigs with ribosomal protein hits, as well as contigs that originate from the assemblies of pregnant data sets. Here is the file, and one could download it into their work directory this way:
wget http://merenlab.org/files/Kowarsky_et_al_additional_data.txt
Then I run the anvi-interactive
on these:
anvi-interactive -c Kowarsky_et_al.db \
-p PROFILE/PROFILE.db \
--additional-layers Kowarsky_et_al_additional_data.txt
When I hit the draw button, this is what I had in my browser:
The ‘Ribosomal P’ layer marks contigs with ribosomal protein hits, and ‘PR origin’ layer indicates marks contigs from blood samples from pregnant women.
As you can see, all contigs with ribosomal protein hits are nicely together in a rather good looking cluster. But more importantly, there is a group of contigs that overlap with that section that are identified as ‘PR origin’ (the ones in closer brances are all from heart transplant, and lung transplant samples). So I made a very conservative selection here to minimize contamination:
This bin is only 20% complete according to the bacterial single-copy core genes from Campbell et al.:
It is about 160K in length (about 1/5th of an average CPR genome), but it is infinite times larger than zero, and it probably has very minimal contamination if any. So far so good!
So at this point I am convinced this is a single population, and this is the anvi’o-reported FASTA file for it if you would like to play with it more: CPR_Bin-from-Kowarsky-et-al.fa
Remember that this bin is recovered from very weak data, and it is very likely contains some noisy bits and pieces despite best efforts.
Is it in a single individual?
The second point is about its prevalence, and the quick answer is “we don’t know because Meren is lazy“.
Just like the fact that this seems to be a single population in the dataset, it may be occurring in a single individual. Becasue CPR genomes are often quite lonley in the sequence space, in our experience, they get assembled very nicely even in very complex samples, or even when they are not quite abundant (i.e., our recovery of CPRs form TARA Oceans metagenomes). The next very important step is to map all metagenomes to this bin to see whether all contigs in it recruits reads from multiple individuals. If you would like to do that, you can get all metagenomic short reads from ‘pregnant’ data sets, and map those to the contigs in this file.
If there is a single individual, it may be possible for someone to go back to that sample, and sequence it deeper to recover a good CPR genome bin.
Hypothesis: Oral cavity may be the actual reservoir for this CPR (Concluision: Not likely)
A very reasonable hypothesis came from Clifford Beall on Twitter after this post appeared online: “Blood bacteria sometimes come from the oral cavity and some CPR are there, could be one of those”.
In our -limited- experience in our lab, CPRs we find in the oral cavity consistently hits to TM7 (a.k.a Saccharibacteria) with rather high identities. However, sequence identities are too low for genes in this bin compared to what we had been seeing. That being said, probably if we understand its prevalence (which requires some mapping), we could have a better idea about its origin. The next thing to do is to map some oral metagenomes to this bin. If you do that, feel free to send me the BAM files or let me know about your findings!
Clifford Beall reports back: no hits from 20 saliva metagenomes
Clifford Beall reported on Twitter that his mapping of 20 saliva metagenomes with 68M paired-end reads didn’t yield any mapping that looked real to him.
Meren confirms: no hits from 14 plaque and tongue metagenomes
So, I also tried to recruit read using this bin from our plaque and tongue metagenomes that contains 587 million quality filtered reads in total. From these half a billion reads, the CPR bin recruited one read. Seriously. One as in one. Since it is only a single read, I will just put it there:
GCTCCGTCCATTTGAGCGGCACCAGTGATCATGTTTTTAACGTAGTCCGCGTGTCCTGGAGCGTCGATGTGAGCGTAGTGACG
The lucky contig is the 4,442 nt long PR_node_96. This contig does not really have a good match on the NCBI, yet the read from the metagenome matches 100% to multple Streptococcus isolates on the NCBI, including usual suspects of oral cavity like S. gordonii, S. mitis, S. sanguins. But the read does not match so perfectly to the contig (but Bowtie2 is so awesome, it managed to find it for us):
Score Expect Identities Gaps Strand
109 bits(120) 3e-28 74/83(89%) 0/83(0%) Plus/Plus
Query 1 GCTCCGTCCATTTGAGCGGCACCAGTGATCATGTTTTTAACGTAGTCCGCGTGTCCTGGA 60
|| |||||||| ||||||||||||||||||||||| || | ||||| ||||| ||||||
PR_96 4342 GCACCGTCCATCTGAGCGGCACCAGTGATCATGTTCTTGATATAGTCAGCGTGACCTGGA 4401
Query 61 GCGTCGATGTGAGCGTAGTGACG 83
|||||||||||||||||||| ||
PR_96 4402 GCGTCGATGTGAGCGTAGTGGCG 4424
I was so surprised by the fact that only one out of half a billion reads were recruited, I started to doubt that there may have been a mistake with the workflow.
So, to test the workflow, I downloaded a random isolate genome that I know occurs a lot in the oral cavity (which happened to be Streptococcus mitis B6 FN568063.1), and used the same exact workflow to map the same metagenomes to that genome, instead of the CPR bin. Instead of one, S. mitis recruited 687,973 reads. So, the workflow seems to be working well, but this CPR bin (1) is nowhere to be found in our oral metagenomes, and (2) does not carry any genes that have any reasonable resemblance to anything in our oral metagenomes.
I’m a bit more curious now.
Functions
Since this is a tiny, incomplete CPR bin with only 203 genes, it is clear that it would be very unlikely to be able to make any conclusive statements regarding its role. But I thought it still wouldn’t hurt to take a look at its functions (since it is very easy with anvi’o (shameless plugs all over)).
I first run NCBI COGs on it:
anvi-run-ncbi-cogs -c Kowarsky_CPR-contigs.db \
--num-threads 10
And created a summary (from which I could get all the gene sequences and functions for the CPR bin):
anvi-summarize -c Kowarsky_CPR-contigs.db \
-p PROFILE/PROFILE.db \
-C default \
-o SUMMARY
From that output, this is the most important file that contains all gene calls and COG functions in the CPR bin: CPR_Bin-from-Kowarsky-et-al-functions.txt
OK.
I myself gazed through the functions very quickly and recognized some that are common to many bacterial genomes, such as histidine kinases, tRNA synthetases, cell division proteins, ABC transporters, etc. Good that this thing really looks like a bacterial genome. One of the first thing that caught my eye because I didn’t recognize them quickly were transketolase genes.
Besides multiple genes for transketolase subunits, one of the contigs carried in a row genes for Transketolase, Pentose-5-phosphate-3-epimerase, Ribose 5-phosphate isomerase, and Glyceraldehyde-3-phosphate dehydrogenase. When my Google search for ‘transketolase’ returned these two papers in the first page, I started to think that someone might be playing a trick on me so I waste weeks of my life on something I know very little about:
OK. In all seriousness, transketolase turns out to be an “important enzyme in the breakdown of glucose through the pentose phosphate pathway”. For instance, in mammals, transketolase connects the pentose phosphate pathway to glycolysis. And the pentose phosphate pathway is a “metabolic pathway parallel to glycolysis” (here is a very nice introductory lecture on this).
The occurrence of these genes in such a lovely synteny suggests that this bin has some sort of pentose phosphate pathway, and it can use products of glycolysis towards the generation of nicotinamide adenine dinucleotide phosphate (NADPH). NADPH is an essential electron donor for anabolic reactions in all domains of life, so the pentose phosphate pathway is pretty common among microbes. To confirm that, I search for COG0036
(Pentose-5-phosphate-3-epimerase) in my collection of 4,233 gold-standard bacterial and archaeal genomes I got from Ensembl, and in fact 43% of all genomes had at least one gene that resolved to this COG. So yeah, not that interesting so far.
But in comparison, of all genomes, 28% have COG0698
(Ribose 5-phosphate isomerase), and 12% have COG3959
(Transketolase, N-terminal subunit). And only 9% of all genomes had all of them together. So even though the pentose phosphate pathway is very common, it seems the one that uses the enzyme transketolase to connect the pentose phosphate pathway to glycolysis is not that common. But is there something common to all those not-that-common geomes? The genomes that had the pentose phosphate pathway genes with transketolase included these guys: Atopobium parvulum, Bacillus licheniformis, Bacteroides vulgatus, Escherichia coli, Klebsiella pneumoniae, Lactobacillus gasseri, Listeria monocytogenes, Mycobacterium smegmatis, Salmonella enterica, Streptococcus pneumoniae, and Yersinia pestis. Most of these are either well-known human pathogens found in blood, or human associated microbes. In fact, the ones that belonged to phyla Proteobacteria, Firmicutes, and Bacteroidetes, made up 77% of all 9% of genomes that had this pathway similar to the one in this CPR bin.
So. Is it safe to say there is something up? Can we say this CPR has some metabolic capabilities that are similar to those pathogens that know how to find their ways in blood? There are number of studies that show the reduced metabolic capabilties and parasitic nature of some of the CPR genomes. So, if I had no shame, I would have insinuated that this CPR bin may suggest the existence of parasitic bacterial populations that can live in mamallian blood by relying on excess glucose in blood. As you know, even if this was true we may have not known about them as they could have been eluding cultivation and marker gene surveys due to their reduced metabolisms and quirky ribosomal proteins just like most other members of the CPR.
I think at this point it would be best to identify the patient these contigs assembled from, and do a much deeper seqeuncing of that sample if it is still available. Meanwhile, mapping publicly available blood metagenomes, or designing PCR primers based on this bin to serach for evidence for its occurrence in other individuals may be potential directions if there is any interest.
Can we say it though?
Regardless of all these, (1) assuming that nothing went wrong with the sampling, library preparation, and sequencing procedures, and (2) after taking into consideration that we may be talking about a single microbial population observed in a single pregnant woman, and (3) after reminding ourselves that this population may not even be active and simply just sweeping into the blood stream from the oral cavity of an individual, it still is probably safe to say for the first time “CPR in human blood!”.
There you have it ;)
New blog post. Investigating some of the microbial contigs assembled from human blood metagenomes by Kowarsky et al. https://t.co/9wd60xd1nh
— A. Murat Eren (@merenbey) August 24, 2017