Getting ribosomal RNA gene sequences from a MinION run

A. Murat Eren (Meren)

Creating an anvi’o contigs database
Taking a look at contig stats
Getting out ribosomal RNA gene sequences

We just started playing with our human gut fecal samples using MinION to get ourselves oriented with Oxford Nanopore’s little device.

Besides all the amazing things long-read sequencers offer, they also come with multiple challenges both at the bench side (such as the difficulties associated with recovering high-molecular weight and high-concentration DNA from samples) and the computational side (such as high rates of substitution errors as well as in/dels that were out of our lives thanks to Illumina). These are particularly serious issues if your focus is genome-resolved metagenomics. Nothing in life seems to just come only with good things. #SAD.

There is not much on our end to talk about yet. Yet with this post we will start simple, and maybe we will have a series of tutorials regarding how to play with MinION data in anvi’o in the long run hopefully with your suggestions and/or questions. Do you have a clear idea what anvi’o could do for you to make sense of your MinION data? Let us know! Also feel free to drop a line if you are curious about the run time, input DNA concentration, size selection, etc.

Questions? Concerns? Find us on

Basically, yesterday Andrea shared with me our very first MinION results from the sequencing of one of the donor gut fecal samples from this study:

Tracking microbial colonization in fecal microbiota transplantation experiments via genome-resolved metagenomics.

And here is a summary of what I did with it.

Creating an anvi’o contigs database

Creating a contigs database is a standard step anytime we have a bunch of reference sequences; whether these reference sequences represent contigs from an assembly, or belong to a single genome. Why should MinION output should be any different? But it is different because it comes as a FASTQ file and anvi’o works only with FASTA files.

So the first step is to convert my minion_D3.fastq (as Andrea named it) into a FASTA file using the illumina-utils library:

iu-fastq-to-fasta minion_D3.fastq \
                  -o minion_D3.fa

A quick note: running anvi-script-FASTA-to-contigs-db minion_D3.fa could generate an anvi’o contigs database directly from this FASTA file, but it is much better to be explicit for a learning opportunity, hence the steps below.

From this file we can create an anvi’o contigs database in multiple steps. First, we would like to simplify the deflines, so they do not contain crazy characters or long long names. This step somewhat standardizes things to decrease the likelihood of spontaneous future headaches:

anvi-script-reformat-fasta minion_D3.fa \
                           --simplify-names \
                           -o minion_D3_fixy.fa
                           
mv minion_D3_fixy.fa minion_D3.fa

Here one could remove very short contigs from the file using the flag --min-len, but I decided to keep everything for now.

Now we have the ‘proper’ FASTA file, we can generate an anvi’o contigs database from it (during which genes would be called using prodigal, and other basic stats would be computed):

anvi-gen-contigs-database -f minion_D3.fa \
                          -o minion_D3.db

Finally we can run the default set of HMMs (which include profiles for ribosomal RNAs) that distribute with anvi’o on the contigs database:

anvi-run-hmms -c minion_D3.db \
              --num-threads 10 

Our contigs database is ready.

Taking a look at contig stats

Since the v3 release, anvi’o has a program called anvi-display-contigs-stats that visualizes basic stats of ‘things’ in a given list of contigs databases, so we can take a quick look at this database by running:

anvi-display-contigs-stats minion_D3.db

This command will open an interactive display:

anvi-display-contigs-stats can also produce a text output, or compare multiple contigs databases (here is a screenshot for that).

Most contigs are pretty well sized. Andrea’s size selection seems to have worked well.

Usually the number of rRNA genes is a fraction of our assembled metagenomes. But in this case there are as many single-copy core genes as there are rRNA hits as shown under the ‘Raw number of HMM Hits’ section. This is due to the fact that HMMs to identify ribosomal RNAs use the DNA alphabet, while the HMMs that identify single-copy core genes rely on amino acid seqeunces. Because indels cause many frame-shift errors in the gene context, amino acid sequences of those crippled gene calls become unrecognizeable to any model. DNA models continue to rock because long hair don’t care.

One of our goals going forward with MinION is to get better at correcting those errors when short read data is available. But that’s for later.

Getting out ribosomal RNA gene sequences

Since we run HMMs for ribosomal RNAs, the contigs database knows which sequences had a significant match to any of the models in our ribosomal RNA collection. We can quickly learn what models available in any HMM collection this way:

anvi-get-sequences-for-hmm-hits -c minion_D3.db \
                                --hmm-source Ribosomal_RNAs \
                                --list-available-gene-names

* Archaeal_23S_rRNA, Archaeal_5S_rRNA, Archaeal_5_8S_rRNA,
  Bacterial_16S_rRNA, Bacterial_23S_rRNA, Bacterial_5S_rRNA,
  Eukaryotic_28S_rRNA, Eukaryotic_5S_rRNA, Eukaryotic_5_8S_rRNA,
  Mitochondrial_12S_rRNA, Mitochondrial_16S_rRNA, Archaeal_16S_rRNA

Here I want to take a quick look at 16S and 23S rRNA genes, but one could select anything from the list of available genes shown above (if you do not specify any gene names, you would get a FASTA file with everything):

anvi-get-sequences-for-hmm-hits -c minion_D3.db \
                                --hmm-source Ribosomal_RNAs \
                                --gene-name Bacterial_16S_rRNA,Bacterial_23S_rRNA \
                                -o minion_D3_rRNA.fa

This gives me a FASTA file with 42 sequences in it.

Naturally, I go to the NCBI with it, and BLAST them all against the nr database.

Here is the summary of top hits this script generated from the search results:

Found in the assembly	Best hit on NCBI	Percent alignment	Percent identity	Accession
Bacterial_23S_rRNA	Alistipes finegoldii	100%	88.17%	NR_103154
Bacterial_23S_rRNA	Anaerostipes hadrus	94.82%	81.49%	CP012098
Bacterial_16S_rRNA	Bacteroides cellulosilyticus	97.66%	91.43%	CP012801
Bacterial_23S_rRNA	Bacteroides cellulosilyticus	94.68%	87.26%	CP012801
Bacterial_23S_rRNA	Bacteroides cellulosilyticus	95.96%	88.25%	CP012801
Bacterial_23S_rRNA	Bacteroides ovatus	96.61%	87.82%	CP012938
Bacterial_16S_rRNA	Bacteroides uniformis	96.16%	87.35%	LT745889
Bacterial_23S_rRNA	Bacteroides uniformis	94.24%	86.61%	LT745890
Bacterial_23S_rRNA	Bacteroides uniformis	95.33%	86.93%	LT745890
Bacterial_23S_rRNA	Bacteroides uniformis	96.14%	84.60%	LT745890
Bacterial_23S_rRNA	Bacteroides uniformis	96.15%	89.04%	LT745890
Bacterial_23S_rRNA	Bacteroides vulgatus	94.09%	83.52%	CP000139
Bacterial_23S_rRNA	Bacteroides vulgatus	94.81%	86.89%	CP000139
Bacterial_23S_rRNA	Bifidobacterium longum	95.74%	83.44%	CP016019
Bacterial_23S_rRNA	Blautia hansenii	94.38%	80.77%	CP022413
Bacterial_16S_rRNA	Blautia luti	96.49%	85.06%	LC010692
Bacterial_23S_rRNA	Blautia sp.	94.10%	79.37%	CP015405
Bacterial_23S_rRNA	Burkholderiales bacterium	94.75%	82.78%	CP015403
Bacterial_23S_rRNA	Burkholderiales bacterium	97.68%	81.00%	CP015403
Bacterial_23S_rRNA	Clostridium sp.	95.22%	82.76%	FJ625862
Bacterial_16S_rRNA	Dorea longicatena	96.68%	85.33%	LT223662
Bacterial_16S_rRNA	Dorea longicatena	97.41%	88.00%	LC037228
Bacterial_16S_rRNA	Eubacterium rectale	95.79%	88.58%	CP001107
Bacterial_23S_rRNA	Eubacterium rectale	92.76%	80.10%	FP929042
Bacterial_23S_rRNA	Eubacterium rectale	94.77%	88.93%	FP929043
Bacterial_23S_rRNA	Eubacterium rectale	96.12%	82.80%	FP929043
Bacterial_23S_rRNA	Eubacterium sp.	94.85%	81.55%	LT635479
Bacterial_16S_rRNA	Faecalibacterium prausnitzii	93.91%	89.31%	CP022479
Bacterial_23S_rRNA	Faecalibacterium prausnitzii	94.37%	85.96%	CP022479
Bacterial_23S_rRNA	Faecalibacterium prausnitzii	95.16%	88.90%	CP022479
Bacterial_23S_rRNA	Faecalibacterium prausnitzii	96.76%	82.23%	FP929046
Bacterial_23S_rRNA	Faecalibacterium prausnitzii	98.97%	90.75%	FP929046
Bacterial_16S_rRNA	Lachnospiraceae bacterium	96.81%	85.00%	LN907763
Bacterial_23S_rRNA	Parabacteroides distasonis	94.52%	84.91%	NR_076385
Bacterial_23S_rRNA	Parabacteroides distasonis	94.91%	82.44%	CP000140
Bacterial_23S_rRNA	Ruminococcus torques	94.80%	84.14%	FP929055
Bacterial_16S_rRNA	uncultured bacterium	97.60%	89.13%	EF404264
Bacterial_16S_rRNA	uncultured bacterium	98.12%	90.72%	EF400825
Bacterial_23S_rRNA	uncultured bacterium	100.00%	86.63%	KX127935
Bacterial_23S_rRNA	uncultured bacterium	96.00%	77.53%	KX126288
Bacterial_23S_rRNA	uncultured bacterium	96.51%	80.03%	KX126288
Bacterial_23S_rRNA	uncultured bacterium	99.94%	81.09%	KX126288

Well.

Notbad

There are some hits that resolve to the same ‘species’ name and very likely represent different operons of the same population, and others that resolve to the same ‘species’ name but closely matching to different references. Clearly abundance plays a critical role here and the number of copies of rRNA genes coming from the same population should have a linear correlation with its abundance in the metagenome. It is funny to think about the fact that we used to get almost none of them with short assembly, now with long-read sequencing we will probably get way too many.

These will get much clearer when we are done with mapping, and start looking at them in the context of the population genomes we generated using the Illumina short reads from the same donor gut.

New blog post on how to take a quick look at #MinION contigs and recover ribosomal RNA sequences from a run.

We probably will SPAM you with little bits an pieces of information like this as we start experimenting with nanopore sequencing: https://t.co/KhZc0gPUbg #anvio
— A. Murat Eren (@merenbey) January 18, 2018

Getting ribosomal RNA gene sequences from a MinION run

Creating an anvi’o contigs database

Taking a look at contig stats

Getting out ribosomal RNA gene sequences

You might also enjoy (View all posts)