Minimum Entropy Decomposition: A new clustering algorithm

A. Murat Eren (Meren)

The manuscript
In a nutshell
Installing
Running MED
Getting a sense of MED
An application

The manuscript

Our new information theory-based clustering algorithm, “Minimum Entropy Decomposition”, is in press to appear in ISMEJ. You can get a copy of the advance online print here.

In a nutshell

MED clusters 16S rRNA gene amplicons (and other marker genes) in a sensitive manner.

The power of the algorithm comes from the principle of entropy minimization to achieve ecologically relevant units in high-throughput sequencing datasets of marker genes, which was introduced by oligotyping in 2013.

In a nutshell, MED

Does not perform pairwise sequence comparison,
Does not rely on arbitrary sequence similarity thresholds,
Does not require user supervision,
Does not require preliminary classification or clustering results,
Is agnostic to sampling strategy or how long your sequences are,
Gives you 1 nucleotide resolution over any sequencing length with computational efficiency and minimal computational heuristics.

The algorithm is simple and intuitive. Here is a paragraph from the manuscript:

[MED] iteratively partitions a dataset of amplicon sequences into homogenous OTUs (‘MED nodes’) that serve as input to alpha- and beta-diversity analyses. MED inherits the core principle of oligotyping (Eren et al., 2013a) and uses Shannon entropy to identify information-rich positions within an internal node. Entropy increases proportionally to the amount of variability in a nucleotide position and MED uses high entropy positions to decompose a node into child nodes. A nucleotide position that directs a decomposition step will have zero entropy in child nodes. Hence, the increasing number of identified nodes decreases the cumulative entropy in the dataset.

Installing

MED has always been a part of the oligotyping pipeline. So, if you install the oligotyping pipeline, you have access to MED as well. This post explains the installation:

https:merenlab.org/2012/05/11/oligotyping-pipeline-explained/

And if you would like to run it in a VirtualBox, that is also an option:

https:merenlab.org/2014/09/02/virtualbox/

I would be very interested in hearing your experiences. We have a discussion forum here, you can use to ask generic questions. If you are having hard time finding your way out, please send me an e-mail and I will try to help you.

Running MED

Once the oligotyping pipeline is installed, running MED is as easy as typing this command in the terminal window, which should give you a static HTML output directory to see your results (you will need to copy-paste the address given in the terminal into your browser window, or open the directory and double-click the index.html file):

decompose your_fasta.fa

Optionally you can provide a sample mapping file, which will generate some basic figures for visual survey of groups:

decompose your_fasta.fa -E sample_mapping.txt

The expected format of the FASTA file is explained here, and you can find the recipe to generate a sample mapping file here.

There is also a help menu with a list of available command line parameters:

decompose -h

Similar to oligotyping, MED expects the read length variation in a given FASTA file (if there is any) to be biologically meaningful. Please read this article to determine which case fits your situation better:

https:merenlab.org/2014/09/14/oligotyping-and-alignment/

Getting a sense of MED

If you would like to give MED a quick try with properly formatted files, you can use this example FASTA file, and this example sample mapping file. In fact the FASTA file is a subsampled version of the sponge dataset we used in the MED manuscript. If you run MED on this dataset with this command:

decompose sponge-1K.fa -E sponge-sample-mapping.txt

The run should end with an HTML output, a copy of which is accessible here:

http://oligotyping.org/MED/files/sponge-html-output/

And here is another example MED output for a human microbiome analysis:

http://oligotyping.org/MED/analyses/HMP-MED-Example/

An application

In the manuscript we analyzed two datasets with MED to compare its results with taxonomical analysis and 97% OTU clustering.

One of these datasets we used was the sponge dataset (the other one was the re-analysis of the oral dataset I also mentioned here recently). Hexadella dedritifera and Hexadella cf. dedritifera are two deep-sea sponge species that are morphologically indistinguishable and can occur sympatrically. They can only be distinguished through genetic surveys, by sequencing and comparing key proteins they possess.

Whether the microbiomes of these two recently differentiated sponge species differ is an important question of ecology, especially with respect to the critical symbiosis between sponges and their microbial residents. The figure below (Figure 3 in the MED paper), demonstrates how does each method answer this question if you sequence the V4V5 region of the 16S rRNA gene to understand the microbial community structure:

While the network analysis based on SILVA-based taxa and 3% OTUs do not show a difference between the microbiomes of two cryptic sponge species, they are far apart from each other in the network based on MED nodes. The reason is because the MED nodes that differentially occur in these hosts are more than 99% identical:

node_000000703      ACGGAGGGTGCAAGCGTTAATCGGAATCACTGGGCGTAAAGCGCACGCAGGCGGTTTGTT
node_000000166      ACGGAGGGTGCAAGCGTTAATCGGAATCACTGGGCGTAAAGCGCACGCAGGCGGTTTGTT
                    ************************************************************

node_000000703      AAGTCAGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCGTTTGATACTGACGAACTAG
node_000000166      AAGTCAGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGGCGAACTAG
                    **************************************** ********** ********

node_000000703      AGTATGGTAGAGAGAAGTGGAATTCCACATGTAGCGGTGAAATGCGTAGAGATGTGGAGG
node_000000166      AGTATGGTAGAGAGAAGTGGAATTCCACATGTAGCGGTGAAATGCGTAGATATGTGGAGG
                    ************************************************** *********

node_000000703      AACATCAGTGGCGAAGGCGACTTCTTGGACCAATACTGACGCTCAGGTGCGAAAGCGTGG
node_000000166      AACATCAGTGGCGAAGGCGACTTCTTGGACCAATACTGACGCTCAGGTGCGAAAGCGTGG
                    ************************************************************

node_000000703      GGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCTACCAGCCGTT
node_000000166      GGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCTACCAGCCGTT
                    ************************************************************

node_000000703      GGGGGACTTGTTCCCTTAGTGGCGAAGCTAACGCGATAAGTAGACCGCCTGGGGAGTACG
node_000000166      GGGGGACTTGTTCCCTTAGTGGCGAAGCTAACGCGATAAGTAGACCGCCTGGGGAGTACG
                    ************************************************************

node_000000703      GCCGCAAGGCTAAA
node_000000166      GCCGCAAGGCTAAA
                    **************

There are clear limitations associated with the 16S rRNA gene, but there is a great deal of information that can be missed through algorithms that are not optimal.