A tutorial on assembly-based metagenomics
The version number of this tutorial is
1.0, and for now it is tailored for Illumina paired-end shotgun sequencing with large inserts (i.e., no substantial overlap between two reads in a given pair).
This is our very initial attempt to put together a comprehensive tutorial. If you would like to change something on this page, you can directly edit its source code by clicking the “Edit this file” icon on the right top corner of the page, and send us a pull request through GitHub. We will be very thankful. You will need to login to GitHub to see that icon.
Assembly and mapping are key steps for most assembly-based, genome-resolved metagenomic studies, and there are many ways to accomplish each of these steps. That’s why, the anvi’o metagenomic workflow only starts once you have your contigs and BAM files available.
But a metagenomic study starts much earlier than assembly and mapping. Experimental design, sampling and storage strategies, library preparation, and sequencing are all critical steps that should be given considerable thought, especially if the purpose is to recover metagenome-assembled genomes.
Despite the diversity of ways to get your contigs and BAM files, or preparing your libraries and perform your sequencing, we thought we could create a resource to explain the metagenomic workflow we and our colleagues have been using in our lab and at the MBL.
This tutorial will take you from notes on sampling and library preparation considerations for sequencing, to assembled contigs and BAM files, at which point you will be ready to follow the anvi’o metagenomic workflow, or any other platform to make sense of your data.
You have the question.
This section mostly deals with the field and wet-lab aspect of the workflow.
DNA extraction? Size selection?
Which sequencing platform? How many samples per lane?
You have your sequences.
This section mostly deals with the computational aspect of the workflow.
If you have some familiarity with the UNIX shell environment, you will have no trouble following the tutorial with your own data. We are working on a more automated workflow which should help people who have no familiarity with the UNIX shell environment, but we need a bit more time for that.
To make things very simple, we will assume you have two samples throughout this tutorial,
Just to make sure we are mostly on the same page and you have all the software you need installed on your system, here are a couple of commands, and their outputs:
You may be lacking some of these software. You can install anvi’o using methods described in this article, or you can take a look at the article that gives recipes to install third-party software (what you are looking for may not be covered in that article if it is very simple to install, in which case Google will help you find out how to install those on your system).
I1 files for an entire Illumina HiSeq or MiSeq lane, and you need to demultiplex your samples using some barcodes.
barcodes_to_samples.txt file should look like this:
where each line has two columns to describe the sample name and its barcode.
If everything is ready, these commands will demultiplex your samples:
It is always a good idea to take a quick look at the report file to make sure everything seems alright:
Now you have everything you need and you can continue with quality filtering.
You have raw
R2 files for
Sample_02, and you need to do quality filtering.
You first need to generate a TAB-delimited
samples.txt file to point out where are your raw
R2 files for each sample:
Then you need create a directory for quality-filtered
R2, and then use
iu-gen-configs program with
samples.txt to crate config files for illumina-utils in it:
Now you are ready to run quality filtering for each of your samples. You can do it for one of them the following way:
You should use
iu-filter-quality-minoche only if you have large inserts. If you have partially overlapping reads, you should use
iu-merge-pairs program at this step.
Alternatively you can do it for all your samples at once in a
for loop, instead of doing it one by one:
Of course, if you have access to a cluster, you should add necessary modifications to these commands.
Once you are done, the contents of the
01_QC/ directory should look like this:
You should definitely take a quick look at
*_STATS.txt files to see whether things went alright. A STATS file will look like this, and you can find more information about it here:
You can take a quick look at all samples by running commands like this:
You can find more information about the options, and ways to visualize the quality filtering results here.
If all looks good, you are ready for a co-assembly.
You have quality-filtered
R2 files for
Sample_02, and you want to get your
Here we will use MEGAHIT for the assembly, if your lab has a different favorite, please help us expand this tutorial by adding yours. If you want to do it but don’t know how, send us an e-mail!
Let’s assume quality-filtered FASTA files for
Sample_02 are in a directory called
01_QC/, which looks like this:
To make things simpler and to minimize human error, let’s create two environment variables:
They will look like this:
We are ready to run MEGAHIT.
You should replace
$NUM_THREADS with actual values you want. We usually use 1000 for
$MIN_CONTIG_SIZE, and 40 for
$NUM_THREADS (which of course depends on the number of CPUs available on a computer system).
Once the co-assembly is done, we first check the log file to take a look at some of the simple stats such as the size of the largest contigs, average contig length, N50, and to make sure things didn’t go south during the assembly:
Then we finalize our
contigs.fa while simplifying contig names in it, and eliminating some of the short contigs if at the same time if necessary:
Once this is done, we have our
contigs.fa under the directory
03_CONTIGS/. Time to map things!
You have the quality-filtered
R2 files for
Sample_02, as well as your
contigs.fa, and you want to get BAM files for your samples.
Here we will use Bowtie2 for mapping, if your lab has a different favorite, please help us expand this tutorial by adding yours.
Let’s first create a new directory for mapping results:
Next, we need to build an index for our contigs:
When this is done, this is how you will get an indexed BAM file for one of your samples:
You should replace
$NUM_THREADS with an actual integer values you prefer. We usually use 8 for
$NUM_THREADS when we run a lot of things in parallel (which of course depends on the number of CPUs available on a computer system).
Clearly you will need to do for all your samples. You can take a look at this snippet to see how you can utilize your
samples.txt file (see the section Quality Filtering if you do not have one):
Did it work? Well, we certainly hope it did. Do not forget to make sure you have all your BAM files in the mapping directory, and they have reasonable sizes:
At this point the size of each BAM file will be proportional to the number of reads from your metagenomes mapped to your contigs, which is an indication of how well your contigs represent the metagenome.
If you are here, it means you have your
contigs.fa, and your BAM files! Congratulations! :)
What is next?
Now you can continue with the anvi’o metagenomic workflow, or you can take a look at all the other anvi’o stuff using this surprisingly fancy menu:
Please let us know if something doesn’t work.