Microbial 'omics

This article describes some recipes to install software written by other groups.

Most software we rely on to enhance anvi’o’s abilities do allow us to re-distribute their code, or have them pre-installed, however, we do not want to follow that route. Although doing that would have made your life much easier, internalizing third-party software from within other platforms directly makes users unable to appreciate other groups’ efforts.

As an apology, we will do our best to keep this article up-to-date, so installing third-party software anvi’o uses will not be a big hassle for you. Thank you for your understanding, and your patience in advance.

Table of Contents

We make a lot of typos, sometimes parameters or versions slightly change, and we fail to keep tutorials up-to-date all the time. If you found a mistake on this page, or if you would like to change something in it, you can directly edit its source code by clicking “Edit this file” icon on the right top corner (which you will see if you have logged in to GitHub), and send us a ‘pull request’. We will be very thankful.

samtools

samtools is a high-performance program to manipulate SAM and BAM files.

Citation: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp352

Go to your terminal, and type samtools --version if you get an error, you need to install it, if the version number is smaller than 1.3.1, you probably need to update it.

You can update samtools and/or install it on your system the following way:

wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
tar -jxvf samtools-1.3.1.tar.bz2 && cd samtools-1.3.1
make && sudo make install

Don’t forget to type samtools --version again to confirm that it is all good!

Prodigal

Prodigal is a bacterial and archaeal gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. Everytime you create a contigs database in anvi’o with anvi-gen-profile-database, you use it.

Citation: http://www.biomedcentral.com/1471-2105/11/119

Go to your terminal, and type prodigal -v if you get an error, you need to install it, if the version number is smaller than 2.6.2, you need to update it.

Here is how to install v2.6.2 (the first line will not work if you don’t have wget, but you can get wget installed esily typing sudo port install wget if you are using MacPorts system on your Mac computer):

wget https://github.com/hyattpd/Prodigal/archive/v2.6.2.tar.gz
tar -zxvf v2.6.2.tar.gz && cd Prodigal-2.6.2/ && make
sudo cp prodigal /usr/local/bin/

Type prodigal -v again to make sure everything is alright, and you get the proper version number.

HMMER

HMMER uses hidden Markov models to perform sequence search and alignments. Everytime you run anvi-run-hmmss program, you use it.

Citation: http://hmmer.org/

Go to your terminal, and type hmmscan -h, if you get an error, you need to install HMMER, if the version number is less than 3.1, you need to update it.

Here is how to install v3.1b2:

wget http://eddylab.org/software/hmmer3/3.1b2/hmmer-3.1b2.tar.gz
tar -zxvf hmmer-3.1b2.tar.gz
cd hmmer-3.1b2
./configure && make && sudo make install
cd easel && make check && sudo make install

Type hmmscan -h again to make sure everything is alright, and you get the proper version number.

SQLite

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. Anvi’o uses SQLite pretty much all the time.

Citation: https://www.sqlite.org/

Go to your terminal, type sqlite3 --version, if you get an error, you need to install it. Extensive installation instructions are here. Or you can install it by typing sudo port install sqlite3 if you are using the port system on a Mac OSX computer.

Note: Although this is completely optional, you may also want to consider installing DB Browser for SQLite. It is a lightweight, open-source database browser a nice graphical interface that is very easy-to-install. You probably will never need it or use it, but it may be handy at some point.

GNU Scientific Library

GSL is a widely used C library for scientific computation. The only thing depends on GSL is the CONCOCT extension in the codebase. The installation is quite straightforward on most systems. If you are using MacPorts, you can type this on your terminal: port install gsl gsl-devel py27-gsl (Rika tells me homebrew on Mac works, too). Otherwise, try these commands and you should be OK:

wget ftp://ftp.gnu.org/gnu/gsl/gsl-latest.tar.gz
tar -zxvf gsl-latest.tar.gz
cd gsl-*
./configure && make && sudo make install

NumPY

NumPY is the fundamental package for scientific computing with Python. Anvi’o uses numpy quite often, and probably not in the best way possible.

Citation: https://arxiv.org/abs/1102.1523

You don’t need to install numpy if you get no complaints back when you type python -c "import numpy" in your terminal. If you do get an import error, then you need to install numpy. You can try this:

sudo pip install numpy

Cython

Cython is “an optimising static compiler for both the Python programming language and the extended Cython programming language”. If python -c "import Cython" in your terminal does not complain, you are golden. Otherwise, install it by running this:

sudo pip install Cython

FastTree

FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences. To install FastTree, first visit this page and follow the instructions to compile it:

http://www.microbesonline.org/fasttree/#Install

If you feel lazy, you can try these commands for a quick installation, too:

wget http://www.microbesonline.org/fasttree/FastTree.c
gcc -DNO_SSE -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm

Regardless of how you compiled it, run this command to make sure it is in your PATH:

sudo mv FastTree /usr/local/bin

If everything is OK, this is the output you should see when you run FastTree on your system:

$ FastTree
Usage for FastTree version 2.1.10 No SSE3:
  FastTree protein_alignment > tree
  FastTree < protein_alignment > tree
  FastTree -out tree protein_alignment
  FastTree -nt nucleotide_alignment > tree
  FastTree -nt -gtr < nucleotide_alignment > tree
  FastTree < nucleotide_alignment > tree
FastTree accepts alignments in fasta or phylip interleaved formats

(...)

HDF5

HDF5 is “a data model, library, and file format for storing and managing data”. If you are not sure what it is, you probably don’t have it, but we are a big fan of HD5 here in anvi’o development side. If you are using macports on your Mac, you can get away with sudo port install hdf5, otherwise you can run these commands on your terminal (these are for version 1.8.16, feel free to check whether there is a newer release of HDF5 from here, and install the most curent tar.bz2 file in that directory instead):

wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.17/src/hdf5-1.8.17.tar.bz2
tar -jxvf hdf5-1.8.17.tar.bz2
cd hdf5-1.8.17
./configure && make && sudo make install

Depending on your operating system and version, you may need to install libhdf5-dev package separately to avoid fatal No such file or directory errors for various header files (we heard complaints from Debian and Ubuntu users).

Centrifuge

Centrifuge is a “classification engine that enables rapid, accurate and sensitive labeling of reads and quantification of species on desktop computers”.

Citation: http://biorxiv.org/content/early/2016/05/25/054965

To install centrifuge, you need to first decide where you want to put all its files on your disk. It could be a directory under /opt, or /usr/local, or somewhere under your user directory, in case you don’t have superuser access on the machine you are working on. Once you know where, open a terminal and set an environment variable to point the base directory you want to keep all centrifuge files:

$ export CENTRIFUGE_BASE="/path/to/a/directory"

Do not forget to make sure your version of /path/to/a/directory is a full path, and starts with a / character.

More on the “full path” thingy: Let’s say I want to put all centrifuge related stuff in a directory called CENTRIFUGE in my home. Here is what I do: First, in my terminal I type cd to makes sure I am in my home directory. Then I type mkdir -p CENTRIFUGE to make sure the directory CENTRIFUGE exists in my home. Then I type cd CENTRIFUGE to go into it. Finally I type pwd to get the full path, and replace that entire string with /path/to/a/directory in the command above (still keeping it in double quotes) before running the export command.

Then you will get the code, and compile it:

cd $CENTRIFUGE_BASE
git clone https://github.com/infphilo/centrifuge
cd centrifuge
git checkout 30e3f06ec35bc83e430b49a052f551a1e3edef42
make

This compiles everything, but does not install anything. To make sure binary files are available directly, you can run this:

$ export PATH=$PATH:$CENTRIFUGE_BASE/centrifuge

If everything is alright so far, this is what you should see if you run the following command:

$ centrifuge --version | head -n 1
centrifuge-class version v1.0.1-beta-27-g30e3f06ec3

Good? Good. If it does not work, it means you made a mistake with your path variables. If it worked, it means you are golden, and now you should add those two lines in your ~/.bashrc or ~/.bash_profile file (whichever one is being used on your system, most likely ~/.bash_profile will work) to make sure it is set in your environment every time you start a new terminal (clearly with the right full path):

export CENTRIFUGE_BASE="/path/to/a/directory"
export PATH=$PATH:$CENTRIFUGE_BASE/centrifuge

You can test whether you managed to do this right by opening a new terminal, and typing centrifuge --version. Did it work? Good. Then you set your environment variables right.

Now you have a working centrifuge installation. But not databases to do anything with. For that, you will need to download pre-computed indexes (unless you want to go full Voldemort and compile your own indexes). The compressed indexes for Bacteria, Viruses, Human genome is 6.3 Gb, and it will take about 9 Gb on your disk uncompressed. You will download this data and unpack it only for once:

$ cd $CENTRIFUGE_BASE
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz
$ tar -zxvf p+h+v.tar.gz && rm -rf p+h+v.tar.gz

If everything went alright, you should see something similar to this when you run the following command:

$ ls -lh $CENTRIFUGE_BASE/p+h+v/*cf
-rw-r--r--   6.5G Feb 15 13:18 $CENTRIFUGE_BASE/p+h+v/p+h+v.1.cf
-rw-r--r--   2.3G Feb 15 13:18 $CENTRIFUGE_BASE/p+h+v/p+h+v.2.cf
-rw-r--r--   1.4M Feb 15 13:18 $CENTRIFUGE_BASE/p+h+v/p+h+v.3.cf

Good? Good! See? You are totally doing this!

MCL

MCL is “a fast and scalable unsupervised cluster algorithm for graphs based on simulation of (stochastic) flow in graphs”, developed by Stijn van Dongen. If when you type mcl --version in your terminal, if you are seeing mcl 14-137 as an output, you are golden. Otherwise you can install it the following way:

 $ wget http://www.micans.org/mcl/src/mcl-14-137.tar.gz
 $ tar -zxvf mcl-14-137.tar.gz && cd mcl-14-137
 $ ./configure && make && sudo make install

Once you are done, you should get a simple usage statement instead of a command not found error when you type mcl in your terminal. If that is the case, you are done.

egnogg-mapper

eggnog-mapper is a tool for fast functional annotation of novel sequences (genes or proteins) using precomputed eggNOG-based orthology assignments.

Citation: http://biorxiv.org/content/early/2016/09/22/076331

The official codebase for eggnog-mapper is here, and a pre-print by Jaime Huerta-Cepas and his colleagues describing the work is here. If you follow this recipe, you should remember that you will be using eggNOG databases with eggnog-mapper, and in your writings you should cite the eggNOG release, too:

Citation: https://www.ncbi.nlm.nih.gov/pubmed/26582926

eggnogg-mapper has online documentation for you to read and set it up on your system yourself, and learn about the details of working with it. This is a recipe for the lazy. If you have a systems administrator, it may be better for them to set it up as a module for everyone. Otherwise, this recipe will tell you how you can you do it within your own space (note that you will need lots of disk space depending on databases you want to download).

To install eggnog-mapper you first need to get the source code, and then you will need to collect the precomputed database files.

First, you need to decide where do you want to put eggnog-mapper and its databases (you will need to change that /path/to/a/directory line to wherever you want on your disk):

$ export EGGNOG_MAPPER_BASE="/path/to/a/directory"

Here is how you get the code:

$ cd $EGGNOG_MAPPER_BASE
$ git clone https://github.com/jhcepas/eggnog-mapper.git
$ cd eggnog-mapper/
$ git checkout tags/0.12.6
$ export PATH=$PATH:$EGGNOG_MAPPER_BASE/eggnog-mapper

At this point if you run this command, you should get the following output:

$ emapper.py --version
emapper-0.12.6

If all is good, now you can download the databases. Which databases you are going to be downloading is up to you (which will not only affect the disk space you need, but also the runtime to screen your genes). Here I will download everything (because I have time and space):

$ download_eggnog_data.py euk bact arch viruses -y

This will take a long very long time mostly due to large I/O overhead to decompress some of the databases with large numbers of smaller files (so do not forget to start the process in a screen), but fortunately you will not do it again.

If you are here, you have the basic setup done. Congratulations.

As a very final step, you should add these two lines in your ~/.bashrc or ~/.bash_profile file (whichever one is being used on your system, most likely ~/.bash_profile will work) to make sure they are set in your environment every time you start a new terminal (don’t forget to update the directory name):

export EGGNOG_MAPPER_BASE="/path/to/a/directory"
export PATH=$PATH:$EGGNOG_MAPPER_BASE/eggnog-mapper

muscle

muscle is a widely used multiple sequence alignment program.

Citation: https://www.ncbi.nlm.nih.gov/pubmed/15034147

Anvi’o uses muscle to align amino acid sequences within each protein cluster while running the pangenomic workflow in v2.1.0 and later versions of anvi’o. Installation is rather easy: go to the downloads page for muscle, grab the one that matches to your operating system, rename the unzipped binary to ‘muscle’, and move it into /usr/local/bin or whichever directory seems to be working.

If you were successful, this is what you should see when you type muscle in your terminal:

$ muscle -version
MUSCLE v3.8.31 by Robert C. Edgar

FAMSA

FAMSA is a fast and accurate multiple seqeunce alignment program for protein sequences.

Citation: https://www.nature.com/articles/srep33964

Use the following commands to download the proper versio of FAMSA on your Linux operating system,

wget https://github.com/refresh-bio/FAMSA/releases/download/v1.2.1/famsa-1.2.1-linux -O famsa

Or on your Mac OSX computer,

wget https://github.com/refresh-bio/FAMSA/releases/download/v1.2.1/famsa-1.2.1-osx -O famsa

and run the following two commands:

chmod +x famsa
sudo mv famsa /usr/local/bin/famsa

You are golden if you are seeing this when you type famsa in your terminal:

$ famsa
FAMSA (Fast and Accurate Multiple Sequence Alignment) ver. 1.2 CPU
  by S. Deorowicz, A. Debudaj-Grabysz, A. Gudys (2017-02-05)

(...)