This article describes some recipes to install software written by other groups.
Most software we rely on to enhance anvi’o’s abilities do allow us to re-distribute their code, or have them pre-installed, however, we do not want to follow that route. Although doing that would have made your life much easier, internalizing third-party software from within other platforms directly makes users unable to appreciate other groups’ efforts.
As an apology, we will do our best to keep this article up-to-date, so installing third-party software anvi’o uses will not be a big hassle for you. Thank you for your understanding, and your patience in advance.
We make a lot of typos, sometimes parameters or versions slightly change, and we fail to keep tutorials up-to-date all the time. If you found a mistake on this page, or if you would like to change something in it, you can directly edit its source code by clicking “Edit this file” icon on the right top corner (which you will see if you have logged in to GitHub), and send us a ‘pull request’. We will be very thankful.
samtools is a high-performance program to manipulate SAM and BAM files.
Go to your terminal, and type
samtools --version if you get an error, you need to install it, if the version number is smaller than 1.3.1, you probably need to update it.
You can update samtools and/or install it on your system the following way:
wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2 tar -jxvf samtools-1.3.1.tar.bz2 && cd samtools-1.3.1 make && sudo make install
Don’t forget to type
samtools --version again to confirm that it is all good!
Prodigal is a bacterial and archaeal gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. Everytime you create a contigs database in anvi’o with
anvi-gen-profile-database, you use it.
Go to your terminal, and type
prodigal -v if you get an error, you need to install it, if the version number is smaller than 2.6.2, you need to update it.
Here is how to install v2.6.2 (the first line will not work if you don’t have wget, but you can get wget installed esily typing
sudo port install wget if you are using MacPorts system on your Mac computer):
wget https://github.com/hyattpd/Prodigal/archive/v2.6.2.tar.gz tar -zxvf v2.6.2.tar.gz && cd Prodigal-2.6.2/ && make sudo cp prodigal /usr/local/bin/
prodigal -v again to make sure everything is alright, and you get the proper version number.
HMMER uses hidden Markov models to perform sequence search and alignments. Everytime you run
anvi-run-hmmss program, you use it.
Go to your terminal, and type
hmmscan -h, if you get an error, you need to install HMMER, if the version number is less than 3.1, you need to update it.
Here is how to install v3.1b2:
wget http://eddylab.org/software/hmmer3/3.1b2/hmmer-3.1b2.tar.gz tar -zxvf hmmer-3.1b2.tar.gz cd hmmer-3.1b2 ./configure && make && sudo make install cd easel && make check && sudo make install
hmmscan -h again to make sure everything is alright, and you get the proper version number.
SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. Anvi’o uses SQLite pretty much all the time.
Go to your terminal, type
sqlite3 --version, if you get an error, you need to install it. Extensive installation instructions are here. Or you can install it by typing
sudo port install sqlite3 if you are using the port system on a Mac OSX computer.
Note: Although this is completely optional, you may also want to consider installing DB Browser for SQLite. It is a lightweight, open-source database browser a nice graphical interface that is very easy-to-install. You probably will never need it or use it, but it may be handy at some point.
GNU Scientific Library
GSL is a widely used C library for scientific computation. The only thing depends on GSL is the CONCOCT extension in the codebase. The installation is quite straightforward on most systems. If you are using MacPorts, you can type this on your terminal:
port install gsl gsl-devel py27-gsl (Rika tells me homebrew on Mac works, too). Otherwise, try these commands and you should be OK:
wget ftp://ftp.gnu.org/gnu/gsl/gsl-latest.tar.gz tar -zxvf gsl-latest.tar.gz cd gsl-* ./configure && make && sudo make install
NumPY is the fundamental package for scientific computing with Python. Anvi’o uses numpy quite often, and probably not in the best way possible.
You don’t need to install numpy if you get no complaints back when you type
python -c "import numpy" in your terminal. If you do get an import error, then you need to install numpy. You can try this:
sudo pip install numpy
Cython is “an optimising static compiler for both the Python programming language and the extended Cython programming language”. If
python -c "import Cython" in your terminal does not complain, you are golden. Otherwise, install it by running this:
sudo pip install Cython
FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences. To install FastTree, first visit this page and follow the instructions to compile it:
If you feel lazy, you can try these commands for a quick installation, too:
wget http://www.microbesonline.org/fasttree/FastTree.c gcc -DNO_SSE -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm
Regardless of how you compiled it, run this command to make sure it is in your PATH:
sudo mv FastTree /usr/local/bin
If everything is OK, this is the output you should see when you run FastTree on your system:
$ FastTree Usage for FastTree version 2.1.10 No SSE3: FastTree protein_alignment > tree FastTree < protein_alignment > tree FastTree -out tree protein_alignment FastTree -nt nucleotide_alignment > tree FastTree -nt -gtr < nucleotide_alignment > tree FastTree < nucleotide_alignment > tree FastTree accepts alignments in fasta or phylip interleaved formats (...)
HDF5 is “a data model, library, and file format for storing and managing data”. If you are not sure what it is, you probably don’t have it, but we are a big fan of HD5 here in anvi’o development side. If you are using macports on your Mac, you can get away with
sudo port install hdf5, otherwise you can run these commands on your terminal (these are for version 1.8.16, feel free to check whether there is a newer release of HDF5 from here, and install the most curent tar.bz2 file in that directory instead):
wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.17/src/hdf5-1.8.17.tar.bz2 tar -jxvf hdf5-1.8.17.tar.bz2 cd hdf5-1.8.17 ./configure && make && sudo make install
Depending on your operating system and version, you may need to install
libhdf5-dev package separately to avoid fatal
No such file or directory errors for various header files (we heard complaints from Debian and Ubuntu users).
Centrifuge is a “classification engine that enables rapid, accurate and sensitive labeling of reads and quantification of species on desktop computers”.
To install centrifuge, you need to first decide where you want to put all its files on your disk. It could be a directory under
/usr/local, or somewhere under your user directory, in case you don’t have superuser access on the machine you are working on. Once you know where, open a terminal and set an environment variable to point the base directory you want to keep all centrifuge files:
$ export CENTRIFUGE_BASE="/path/to/a/directory"
Do not forget to make sure your version of
/path/to/a/directory is a full path, and starts with a
More on the “full path” thingy: Let’s say I want to put all centrifuge related stuff in a directory called
CENTRIFUGE in my home. Here is what I do: First, in my terminal I type
cd to makes sure I am in my home directory. Then I type
mkdir -p CENTRIFUGE to make sure the directory
CENTRIFUGE exists in my home. Then I type
cd CENTRIFUGE to go into it. Finally I type
pwd to get the full path, and replace that entire string with
/path/to/a/directory in the command above (still keeping it in double quotes) before running the export command.
Then you will get the code, and compile it:
cd $CENTRIFUGE_BASE git clone https://github.com/infphilo/centrifuge cd centrifuge git checkout 30e3f06ec35bc83e430b49a052f551a1e3edef42 make
This compiles everything, but does not install anything. To make sure binary files are available directly, you can run this:
$ export PATH=$PATH:$CENTRIFUGE_BASE/centrifuge
If everything is alright so far, this is what you should see if you run the following command:
$ centrifuge --version | head -n 1 centrifuge-class version v1.0.1-beta-27-g30e3f06ec3
Good? Good. If it does not work, it means you made a mistake with your path variables. If it worked, it means you are golden, and now you should add those two lines in your
~/.bash_profile file (whichever one is being used on your system, most likely
~/.bash_profile will work) to make sure it is set in your environment every time you start a new terminal (clearly with the right full path):
export CENTRIFUGE_BASE="/path/to/a/directory" export PATH=$PATH:$CENTRIFUGE_BASE/centrifuge
You can test whether you managed to do this right by opening a new terminal, and typing
centrifuge --version. Did it work? Good. Then you set your environment variables right.
Now you have a working centrifuge installation. But not databases to do anything with. For that, you will need to download pre-computed indexes (unless you want to go full Voldemort and compile your own indexes). The compressed indexes for Bacteria, Viruses, Human genome is 6.3 Gb, and it will take about 9 Gb on your disk uncompressed. You will download this data and unpack it only for once:
$ cd $CENTRIFUGE_BASE $ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz $ tar -zxvf p+h+v.tar.gz && rm -rf p+h+v.tar.gz
If everything went alright, you should see something similar to this when you run the following command:
$ ls -lh $CENTRIFUGE_BASE/p+h+v/*cf -rw-r--r-- 6.5G Feb 15 13:18 $CENTRIFUGE_BASE/p+h+v/p+h+v.1.cf -rw-r--r-- 2.3G Feb 15 13:18 $CENTRIFUGE_BASE/p+h+v/p+h+v.2.cf -rw-r--r-- 1.4M Feb 15 13:18 $CENTRIFUGE_BASE/p+h+v/p+h+v.3.cf
Good? Good! See? You are totally doing this!
MCL is “a fast and scalable unsupervised cluster algorithm for graphs based on simulation of (stochastic) flow in graphs”, developed by Stijn van Dongen. If when you type
mcl --version in your terminal, if you are seeing
mcl 14-137 as an output, you are golden. Otherwise you can install it the following way:
$ wget http://www.micans.org/mcl/src/mcl-14-137.tar.gz $ tar -zxvf mcl-14-137.tar.gz && cd mcl-14-137 $ ./configure && make && sudo make install
Once you are done, you should get a simple usage statement instead of a command not found error when you type
mcl in your terminal. If that is the case, you are done.
eggnog-mapper is a tool for fast functional annotation of novel sequences (genes or proteins) using precomputed eggNOG-based orthology assignments.
The official codebase for
eggnog-mapper is here, and a pre-print by Jaime Huerta-Cepas and his colleagues describing the work is here. If you follow this recipe, you should remember that you will be using
eggNOG databases with
eggnog-mapper, and in your writings you should cite the
eggNOG release, too:
eggnogg-mapper has online documentation for you to read and set it up on your system yourself, and learn about the details of working with it. This is a recipe for the lazy. If you have a systems administrator, it may be better for them to set it up as a module for everyone. Otherwise, this recipe will tell you how you can you do it within your own space (note that you will need lots of disk space depending on databases you want to download).
eggnog-mapper you first need to get the source code, and then you will need to collect the precomputed database files.
First, you need to decide where do you want to put
eggnog-mapper and its databases (you will need to change that
/path/to/a/directory line to wherever you want on your disk):
$ export EGGNOG_MAPPER_BASE="/path/to/a/directory"
Here is how you get the code:
$ cd $EGGNOG_MAPPER_BASE $ git clone https://github.com/jhcepas/eggnog-mapper.git $ cd eggnog-mapper/ $ git checkout tags/0.12.6 $ export PATH=$PATH:$EGGNOG_MAPPER_BASE/eggnog-mapper
At this point if you run this command, you should get the following output:
$ emapper.py --version emapper-0.12.6
If all is good, now you can download the databases. Which databases you are going to be downloading is up to you (which will not only affect the disk space you need, but also the runtime to screen your genes). Here I will download everything (because I have time and space):
$ download_eggnog_data.py euk bact arch viruses -y
This will take a long very long time mostly due to large I/O overhead to decompress some of the databases with large numbers of smaller files (so do not forget to start the process in a
screen), but fortunately you will not do it again.
If you are here, you have the basic setup done. Congratulations.
As a very final step, you should add these two lines in your
~/.bash_profile file (whichever one is being used on your system, most likely
~/.bash_profile will work) to make sure they are set in your environment every time you start a new terminal (don’t forget to update the directory name):
export EGGNOG_MAPPER_BASE="/path/to/a/directory" export PATH=$PATH:$EGGNOG_MAPPER_BASE/eggnog-mapper
muscle is one of the best-performing multiple alignment programs.
Anvi’o uses muscle to align amino acid sequences within each protein cluster while running the pangenomic workflow in
v2.1.0 and later versions of anvi’o. Installation is rather easy: go to the downloads page for
muscle, grab the one that matches to your operating system, rename the unzipped binary to ‘muscle’, and move it into
/usr/local/bin or whichever directory seems to be working.
If you were successful, this is what you should see when you type
muscle in your terminal:
$ muscle -version MUSCLE v3.8.31 by Robert C. Edgar