This script takes the 'metadata' output of the program
ncbi-genome-download (see https://github.com/kblin/ncbi-genome-download for details), and processes each GenBank file found in the metadata file to generate a FASTA file, as well as genes and functions files for each entry. Plus, it autmatically generates a FASTA TXT file descriptor for anvi'o snakemake workflows. So it is a multi-talented program like that.
Suppose you have downloaded some genomes from NCBI (using this incredibly useful program) and you have a metadata table describing those genomes. This program will convert that metadata table into some useful files, namely: a FASTA file of contig sequences, an external gene calls file, and an external functions file for each genome you have downloaded; as well as a single tab-delimited fasta-txt file (like the one shown here) describing the path to each of these files for all downloaded genomes (that you can pass directly to a snakemake workflow if you need to). Yay.
The metadata file
The prerequisite for running this program is to have a tab-delimited metadata file containing information about each of the genomes you downloaded from NCBI. Let’s say your download command started like this:
ncbi-genome-download --metadata-table ncbi_metadata.txt -t .... So for the purposes of this usage tutorial, your metadata file is called
In case you are wondering, that file should have a header that looks something like this:
assembly_accession bioproject biosample wgs_master excluded_from_refseq refseq_category relation_to_type_material taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_dateasm_name submitter gbrs_paired_asm paired_asm_comp ftp_path local_filename
If you run this, all the output files will show up in your current working directory.
anvi-script-process-genbank-metadata -m ncbi_metadata.txt
Choosing an output directory
Alternatively, you can specify a directory in which to generate the output:
anvi-script-process-genbank-metadata -m ncbi_metadata.txt -o DOWNLOADED_GENOMES
Picking a name for the fasta-txt file
The default name for the fasta-txt file is
fasta-input.txt, but you can change that with the
anvi-script-process-genbank-metadata -m ncbi_metadata.txt --output-fasta-txt ncbi_fasta.txt
Make a fasta-txt without the gene calls and functions columns
The default columns in the fasta-txt file are:
name path external_gene_calls gene_functional_annotation
But sometimes, you don’t want your downstream snakemake workflow to use those external gene calls or functional annotations files. So to skip adding those columns into the fasta-txt file, you can use the
anvi-script-process-genbank-metadata -m ncbi_metadata.txt --output-fasta-txt ncbi_fasta.txt -E
Then the fasta-txt will only contain a
name column and a
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.