A tutorial on assembly-based metagenomics

Table of Contents

The version number of this tutorial is 1.0, and for now it is tailored for Illumina paired-end shotgun sequencing with large inserts (i.e., no substantial overlap between two reads in a given pair).

This is our very initial attempt to put together a comprehensive tutorial. If you would like to change something on this page, you can directly edit its source code by clicking the “Edit this file” icon on the right top corner of the page, and send us a pull request through GitHub. We will be very thankful. You will need to login to GitHub to see that icon.

Assembly and mapping are key steps for most assembly-based, genome-resolved metagenomic studies, and there are many ways to accomplish each of these steps. That’s why, the anvi’o metagenomic workflow only starts once you have your contigs and BAM files available.

But a metagenomic study starts much earlier than assembly and mapping. Experimental design, sampling and storage strategies, library preparation, and sequencing are all critical steps that should be given considerable thought, especially if the purpose is to recover metagenome-assembled genomes.

Despite the diversity of ways to get your contigs and BAM files, or preparing your libraries and perform your sequencing, we thought we could create a resource to explain the metagenomic workflow we and our colleagues have been using in our lab and at the MBL.

This tutorial will take you from notes on sampling and library preparation considerations for sequencing, to assembled contigs and BAM files, at which point you will be ready to follow the anvi’o metagenomic workflow, or any other platform to make sense of your data.

Before Sequencing

You have the question.

This section mostly deals with the field and wet-lab aspect of the workflow.

Sampling

Storage? Replicates?

Library preparation

DNA extraction? Size selection?

Sequencing

Which sequencing platform? How many samples per lane?

After Sequencing

You have your sequences.

This section mostly deals with the computational aspect of the workflow.

If you have some familiarity with the UNIX shell environment, you will have no trouble following the tutorial with your own data. We are working on a more automated workflow which should help people who have no familiarity with the UNIX shell environment, but we need a bit more time for that.

To make things very simple, we will assume you have two samples throughout this tutorial, Sample_01 and Sample_02.

Just to make sure we are mostly on the same page and you have all the software you need installed on your system, here are a couple of commands, and their outputs:

meren SSH://MBL ~ $ bowtie2 --version | head -n 1 | awk '{print $3}'
2.2.9
meren SSH://MBL ~ $ iu-filter-quality-minoche -v
Illumina-utils v1.4.6
meren SSH://MBL ~ $ samtools --version
samtools 1.3
Using htslib 1.3
meren SSH://MBL ~ $ megahit -v
MEGAHIT v1.0.6
meren SSH://MBL ~ $ anvi-init-bam -v
Anvio version ................................: 2.0.1
Profile DB version ...........................: 16
Contigs DB version ...........................: 6
Samples information DB version ...............: 2
Auxiliary HDF5 DB version ....................: 1
Users DB version (for anvi-server) ...........: 1

You may be lacking some of these software. You can install anvi’o using methods described in this article, or you can take a look at the article that gives recipes to install third-party software (what you are looking for may not be covered in that article if it is very simple to install, in which case Google will help you find out how to install those on your system).

Let’s start.

Demultiplexing

You have R1, R2, and I1 files for an entire Illumina HiSeq or MiSeq lane, and you need to demultiplex your samples using some barcodes.

The TAB-delimited barcodes_to_samples.txt file should look like this:

Sample_01 TAATCGGATTCC
Sample_02 GCACACACGTTA

where each line has two columns to describe the sample name and its barcode.

If everything is ready, these commands will demultiplex your samples:

meren SSH://MBL ~ $ mkdir 00_RAW/
meren SSH://MBL ~ $ iu-demultiplex -s barcode_to_sample.txt --r1 R1.fastq --r2 R2.fastq --index I1.fastq -o 00_RAW/
meren SSH://MBL ~ $ ls 00_RAW/
00_DEMULTIPLEXING_REPORT
Sample_01-R1.fastq     
Sample_01-R2.fastq
Sample_02-R1.fastq 
Sample_02-R2.fastq

It is always a good idea to take a quick look at the report file to make sure everything seems alright:

meren SSH://MBL ~ $ column -t 00_RAW/00_DEMULTIPLEXING_REPORT

Now you have everything you need and you can continue with quality filtering.

Quality Filtering

You have raw R1 and R2 files for Sample_01 and Sample_02, and you need to do quality filtering.

You first need to generate a TAB-delimited samples.txt file to point out where are your raw R1 and R2 files for each sample:

sample r1 r2
Sample_01 00_RAW/Sample_01-R1.fastq 00_RAW/Sample_01-R2.fastq
Sample_02 00_RAW/Sample_02-R1.fastq 00_RAW/Sample_02-R2.fastq

Then you need create a directory for quality-filtered R1 and R2, and then use iu-gen-configs program with samples.txt to crate config files for illumina-utils in it:

meren SSH://MBL ~ $ mkdir 01_QC
meren SSH://MBL ~ $ iu-gen-configs samples.txt -o 01_QC
meren SSH://MBL ~ $ ls 01_QC/
Sample_01.ini
Sample_02.ini

Now you are ready to run quality filtering for each of your samples. You can do it for one of them the following way:

meren SSH://MBL ~ $ iu-filter-quality-minoche 01_QC/Sample_01.ini

You should use iu-filter-quality-minoche only if you have large inserts. If you have partially overlapping reads, you should use iu-merge-pairs program at this step.

Alternatively you can do it for all your samples at once in a for loop, instead of doing it one by one:

meren SSH://MBL ~ $ for ini in 01_QC/*.ini; do iu-filter-quality-minoche $ini; done

Of course, if you have access to a cluster, you should add necessary modifications to these commands.

Once you are done, the contents of the 01_QC/ directory should look like this:

meren SSH://MBL ~ $ ls 01_QC/
Sample_01-QUALITY_PASSED_R1.fastq
Sample_01-QUALITY_PASSED_R2.fastq
Sample_01-READ_IDs.cPickle.z
Sample_01-STATS.txt
Sample_01.ini
Sample_02-QUALITY_PASSED_R1.fastq
Sample_02-QUALITY_PASSED_R2.fastq
Sample_02-READ_IDs.cPickle.z
Sample_02-STATS.txt

You should definitely take a quick look at *_STATS.txt files to see whether things went alright. A STATS file will look like this, and you can find more information about it here:

meren SSH://MBL ~ $ cat 01_QC/Sample_01-STATS.txt
number of pairs analyzed      : 122929
total pairs passed            : 109041 (%88.70 of all pairs)
  total pair_1 trimmed        : 6476 (%5.94 of all passed pairs)
  total pair_2 trimmed        : 9059 (%8.31 of all passed pairs)
total pairs failed            : 13888 (%11.30 of all pairs)
  pairs failed due to pair_1  : 815 (%5.87 of all failed pairs)
  pairs failed due to pair_2  : 12193 (%87.80 of all failed pairs)
  pairs failed due to both    : 880 (%6.34 of all failed pairs)
  FAILED_REASON_P             : 12223 (%88.01 of all failed pairs)
  FAILED_REASON_N             : 38 (%0.27 of all failed pairs)
  FAILED_REASON_C33           : 1627 (%11.72 of all failed pairs)

You can take a quick look at all samples by running commands like this:

meren SSH://MBL ~ $ grep 'total pairs passed' 01_QC/*STATS.txt

You can find more information about the options, and ways to visualize the quality filtering results here.

If all looks good, you are ready for a co-assembly.

Co-assembly

You have quality-filtered R1 and R2 files for Sample_01 and Sample_02, and you want to get your contigs.fa.

Here we will use MEGAHIT for the assembly, if your lab has a different favorite, please help us expand this tutorial by adding yours. If you want to do it but don’t know how, send us an e-mail!

Let’s assume quality-filtered FASTA files for Sample_01 and Sample_02 are in a directory called 01_QC/, which looks like this:

meren SSH://MBL ~ $ ls 01_QC/*fastq
Sample_01-QUALITY_PASSED_R1.fastq
Sample_01-QUALITY_PASSED_R2.fastq
Sample_02-QUALITY_PASSED_R1.fastq
Sample_02-QUALITY_PASSED_R2.fastq

To make things simpler and to minimize human error, let’s create two environment variables:

meren SSH://MBL ~ $ R1s=`ls 01_QC/*QUALITY_PASSED_R1* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
meren SSH://MBL ~ $ R2s=`ls 01_QC/*QUALITY_PASSED_R2* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`

They will look like this:

meren SSH://MBL ~ $ echo $R1s
01_QC/Sample_01-QUALITY_PASSED_R1.fastq,01_QC/Sample_02-QUALITY_PASSED_R1.fastq
meren SSH://MBL ~ $ echo $R2s
01_QC/Sample_01-QUALITY_PASSED_R2.fastq,01_QC/Sample_02-QUALITY_PASSED_R2.fastq

We are ready to run MEGAHIT.

meren SSH://MBL ~ $ megahit -1 $R1s -2 $R2s --min-contig-len $MIN_CONTIG_SIZE -m 0.85 -o 02_ASSEMBLY/ -t $NUM_THREADS

You should replace $MIN_CONTIG_SIZE and $NUM_THREADS with actual values you want. We usually use 1000 for $MIN_CONTIG_SIZE, and 40 for $NUM_THREADS (which of course depends on the number of CPUs available on a computer system).

Once the co-assembly is done, we first check the log file to take a look at some of the simple stats such as the size of the largest contigs, average contig length, N50, and to make sure things didn’t go south during the assembly:

meren SSH://MBL ~ $ tail 02_ASSEMBLY/log
    [assembler.cpp             : 188]     unitig graph size: 8343857, time for building: 359.859433
    [assembler.cpp             : 206]     Number of bubbles/complex bubbles removed: 854393/135891, Time elapsed(sec): 11.599862
    [assembler.cpp             : 230]     Unitigs removed in excessive pruning: 218649, time: 2.998776
    [assembler.cpp             : 279]     Number of local low depth unitigs removed: 31409, complex bubbles removed: 44857, time: 142.707443
    [assembler.cpp             : 132]     Total length: 3189757381, N50: 1335, Mean: 547, number of contigs: 5824928
    [assembler.cpp             : 133]     Maximum length: 272005
    [utils.h                   : 126]     Real: 1237.1572	user: 65588.0131	sys: 40.3599	maxrss: 18847612
--- [Tue Jun 21 09:50:48 2016] Merging to output final contigs ---
--- [STAT] 787465 contigs, total 1854094345 bp, min 1000 bp, max 272005 bp, avg 2355 bp, N50 2640 bp
--- [Tue Jun 21 09:51:19 2016] ALL DONE. Time elapsed: 36922.054002 seconds ---

Then we finalize our contigs.fa while simplifying contig names in it, and eliminating some of the short contigs if at the same time if necessary:

meren SSH://MBL ~ $ mkdir 03_CONTIGS
meren SSH://MBL ~ $ anvi-script-reformat-fasta 02_ASSEMBLY/final.contigs.fa -o 03_CONTIGS/contigs.fa --min-len 2500 --simplify-names --report name_conversions.txt

Once this is done, we have our contigs.fa under the directory 03_CONTIGS/. Time to map things!

Mapping

You have the quality-filtered R1 and R2 files for Sample_01 and Sample_02, as well as your contigs.fa, and you want to get BAM files for your samples.

Here we will use Bowtie2 for mapping, if your lab has a different favorite, please help us expand this tutorial by adding yours.

Let’s first create a new directory for mapping results:

meren SSH://MBL ~ $ mkdir 04_MAPPING

Next, we need to build an index for our contigs:

meren SSH://MBL ~ $ bowtie2-build 03_CONTIGS/contigs.fa 04_MAPPING/contigs

When this is done, this is how you will get an indexed BAM file for one of your samples:

meren SSH://MBL ~ $ bowtie2 --threads $NUM_THREADS -x 04_MAPPING/contigs -1 01_QC/Sample_01-QUALITY_PASSED_R1.fastq -2 01_QC/Sample_01-QUALITY_PASSED_R2.fastq -S 04_MAPPING/Sample_01.sam
meren SSH://MBL ~ $ samtools view -F 4 -bS 04_MAPPING/Sample_01.sam > 04_MAPPING/Sample_01-RAW.bam
meren SSH://MBL ~ $ anvi-init-bam 04_MAPPING/Sample_01-RAW.bam -o 04_MAPPING/Sample_01.bam
meren SSH://MBL ~ $ rm 04_MAPPING/Sample_01.sam 04_MAPPING/Sample_01-RAW.bam

You should replace $NUM_THREADS with an actual integer values you prefer. We usually use 8 for $NUM_THREADS when we run a lot of things in parallel (which of course depends on the number of CPUs available on a computer system).

Clearly you will need to do for all your samples. You can take a look at this snippet to see how you can utilize your samples.txt file (see the section Quality Filtering if you do not have one):

Did it work? Well, we certainly hope it did. Do not forget to make sure you have all your BAM files in the mapping directory, and they have reasonable sizes:

meren SSH://MBL ~ $ ls -lh 04_MAPPING/
-rw-rw-r-- 1 meren merenlab 476M Jul 14 10:25 Sample_01.bam
-rw-rw-r-- 1 meren merenlab 1.9M Jul 14 10:26 Sample_01.bam.bai
-rw-rw-r-- 1 meren merenlab 772M Jul 14 10:29 Sample_02.bam
-rw-rw-r-- 1 meren merenlab 1.9M Jul 14 10:30 Sample_02.bam.bai

At this point the size of each BAM file will be proportional to the number of reads from your metagenomes mapped to your contigs, which is an indication of how well your contigs represent the metagenome.

If you are here, it means you have your contigs.fa, and your BAM files! Congratulations! :)

What is next?

Now you can continue with the anvi’o metagenomic workflow, or you can take a look at all the other anvi’o stuff using this surprisingly fancy menu:

Please let us know if something doesn’t work.