Microbial 'omics


Brought to you by

a post by Evan Kiefl

Table of Contents

This feature is implemented in anvi’o v7 and later. You can identify which version you have on your computer by typing anvi-self-test --version in your terminal.

The latest version of anvi’o is v6.2. See the release notes.

Who is this post for?

NOTE: For a more practical tutorial on the same topic, please visit the infant gut tutorial

This post is like a technical diary for my implementation of InteracDome into anvi’o. The primary purpose of this is to provide all the technical details in one place, so that (1) people who inevitably use this in anvi’o know what is going on under the hood, and (2) so people digging into the codebase for debugging or extending features understand why I made certain decisons. It is thus a very technical blog post, and in that sense is quite distinct from other stuff we write about. But if you think you may be interested, please keep reading. I will not pretend that this has a start middle and end–it is more a reference for all the decisions I made. Just keep in mind that I will not be shying away from sharing code snippets from the codebase, so if you’re code-shy, this is your trigger warning.

This post is not version-controlled. The codebase is dynamic and will inevitably change, but this post will (probably) not, and is therefore merely a snapshot of what once was (July 20th, 2020). I include it in the hopes it provides conceptual clarity, and hope it is not taken too seriously, or an accurate representation of anvi’o’s codebase.

Introduction

Far too often we do not know what our SNV, SAAV, and SCV data mean. The problem is that our data output is nucleotide sequence, which defines the blueprint for function, but is a very abstracted output from what we are trying to learn about. As eloquently put by Harms and Thornton, as a practical convenience, we as evolutionary biologists have a tendency to “treat molecular sequences as mere strings of letters, the patterns of which carry the traces of historical processes, rather than as functioning objects for which the physical properties determine their behaviour”. It is from these patterns that we can learn so much by analyzing variant data derived from metagenomes. However, given that the SNV/SAAV/SCV (I’ll just call them variants from now on) patterns by themselves offer zero insight into fitness, within our field, variant data is most commonly used to identify and track “strains”. And to be honest, that saddens me because there is so much more potential than using them as ecological markers.

To bridge a gap between metagenomic sequence variants and the structural biology of gene products that underpin function and fitness is no easy task, but I think its critical if we want empower our findings with biochemical information. (As a small step in this direction, Özcan, Meren, and I developed anvi-3dev, a way to visualize metagenomic sequence variants directly on predicted protein structures).

Related to this concept, one of the most transformative talks I have ever attended was given by Dr. Mona Singh about this paper (Kobren and Singh, 2018). Kobren and Singh took every Pfam family and searched for whether or not any members of the family had crystalized structures that co-complexed with bound ligands. If so, then they calculated 3D distance scores of each residue to the ligand(s) and used these distances as an inverse proxy for binding likelihood (the more likely you are to be involved in the binding of a ligand, the closer you are in physical space to the ligand). They aggregated the totality of these results and ended up with thousands of Pfam IDs for which they could attribute per-residue ligand-binding scores to the Pfam’s associated hidden Markov model (HMM) profile. They used this to extend our knowledge of the human proteome, however when I was listening to the talk all I was thinking about was applying it to metagenomics.

They entitled their software, InteracDome, and there is an online server where you can give an amino acid sequence, and their server will run an HMM of your sequence against the InteracDome database and attribute estimated binding scores for each residue in your sequence. That is basically exactly what I wanted to do, except I wanted to extend this to the potentially massive number of genes that can typically found in anvi’o contigs databases. Furthermore, I wanted to store all the information in the contigs database for further data analysis and visualization, as is the anvi’o way.

Breaking it down into (6) components

Here are the bird’s-eye-view components of InteracDome’s implementation into anvi’o:

  1. Storing a local copy of InteracDome’s tab-separated files
  2. Storing the HMM profiles that the contig database’s genes are searched against
  3. Running the HMM
  4. Parsing HMMER’s standard output file
  5. Filtering HMM hits
  6. Matching binding frequencies to the gene’s residues
  7. Storing the per-residue binding frequencies into the contigs database

(1) and (2): Storing copies of the InteracDome datasets and the corresponding HMM profiles

Description of tab-separated InteracDome files

What are these tab-separated files, you say? Well they contain the binding frequencies associated to each match-state (read as: residue) of the HMM. Here is what one looks like (be sure to scroll right to get a complete picture of the table):

pfam_id domain_length ligand_type num_nonidentical_instances num_structures binding_frequencies
PF00001_7tm_1 268 1WV 3 4 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.460539215686,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.539460784314,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.360294117647,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
PF00001_7tm_1 268 2CV 7 10 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.103823009901,0,0,0.103823009901,0,0,0,0,0,0.207401960435,0,0,0,0.155734514852,0,0,0.577037641019,0,0,0.572218731277,0.0519115049506,0,0.310492792233,0.473458690485,0,0,0,0,0,0.155490455484,0,0,0,0,0,0.109873556499,0,0,0.051667445583,0,0.0519115049506,0,0,0,0.574814741526,0.371783032133,0,0,0,0,0,0,0,0,0,0,0,0,0,0.582709105685,0.681678668828,0.112286030157,0,0,0,0,0,0,0,0.0517419453701,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0519115049506,0.207646019802,0,0,0,0,0,0,0.0519115049506,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0519115049506,0,0.103823009901,0.155734514852,0,0,0.0519115049506,0.0519115049506,0,0,0,0,0,0,0,0,0,0,0.0519115049506,0,0.051667445583,0,0,0,0.0544330154125,0,0,0,0,0,0.051667445583,0,0,0,0,0,0,0
PF00001_7tm_1 268 ERM 6 5 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,0.123851117406,0,0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.123851117406,0.258781869688,0.134930752282,0.125645262827,0.125645262827,0,0,0.490352533837,0.296797293044,0,0.125220333648,0.134930752282,0,0,0,0,0,0,0.134930752282,0,0.31920050362,0,0,0.296797293044,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.296797293044,0,0,0.490352533837,0,0,0,0.490352533837,0.123851117406,0.249071451054,0.318775574441,0.193555240793,0,0.125220333648,0,0.249496380233,0,0,0,0.125645262827,0,0,0,0,0.431728045326,0.296797293044,0,0.123851117406,0.422017626692,0,0,0.249071451054,0,0,0.123851117406,0.125220333648,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
PF00003_7tm_3 238 SM_ 3 4 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.248912097476,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.248912097476,0,0,1.0,1.0,0,0,0.498259355962,0.751087902524,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.248912097476,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.249347258486,0.249347258486,0.497824194952,0,0.249347258486,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.249347258486,0,0.497824194952,0,0.249347258486
PF00003_7tm_3 238 ALL_ 3 4 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.248912097476,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.248912097476,0,0,1.0,1.0,0,0,0.498259355962,0.751087902524,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.248912097476,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.249347258486,0.249347258486,0.497824194952,0,0.249347258486,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.249347258486,0,0.497824194952,0,0.249347258486
PF00004_AAA 132 SM_ 205 166 0.0114604627231,0.00477140578191,0.00159046859397,0.00573253589269,0.00636187437588,0.0156074891605,0.151915630049,0.00636187437588,0.165530516241,0.00465063703568,0.7665876828,0.782039254129,0.638852101367,0.0148199070736,0.0197556615281,0.0659386849446,0.0233671491129,0.0274992668572,0.0068758651402,0.0243892439834,0.016548822645,0.0175596896304,0.0235691023598,0.0148281476804,0.0108389692647,0.0200299870214,0.0158208194559,0.136698750907,0.0183922951312,0.0108666290365,0.00319897110896,0.00352559062884,0.00340017637127,0.00310366737855,0.00743451388943,0,0,0.00226506315454,0.00226506315454,0.00272300017124,0,0.00159046859397,0.00435666305088,0.00833301017716,0.00674254158319,0.00318093718794,0.00871332610176,0.00871332610176,0.00594713164485,0.00636187437588,0.00477140578191,0.00636187437588,0.00159046859397,0.00159046859397,0,0,0,0,0.0166834633166,0.0124614217996,0.0123052272426,0.00332776801496,0,0,0.0235604250871,0.0205062852754,0.00159046859397,0.00159046859397,0.00443759986205,0.0108022602052,0,0.00758910697789,0.042872742652,0,0,0,0.0205374756636,0,0,0,0.00346838508927,0.00216787556951,0.00390892326059,0.00781784652118,0.00604666604827,0.00768955958345,0.0109017146314,0.012587618788,0.00954249832977,0.0143484218372,0.00116949287805,0.0021871566269,0,0.0243284702568,0.00283095932639,0.00102104179321,0.00897890971083,0.00204208358642,0.0122165343967,0,0,0,0,0.00332776801496,0.00649934147293,0.014056198686,0.00317157345797,0,0.00410153185904,0.0647003263302,0.00826993895804,0.00378063632286,0,0.0446638329233,0.000961349923678,0.00186720282586,0.00186260580663,0.0444138756133,0.00340073494863,0.00628002363626,0.0196825372299,0.00198595471452,0.0189300031233,0.0107861573734,0.00416646190614,0.00314850655058,0.0220221369003,0.00639178475723,0.0043743132538,0.00886361588898,0.0110906488313,0.0021871566269
PF00004_AAA 132 DRUGLIKE_ 205 166 0.0114604627231,0.00477140578191,0.00159046859397,0.00573253589269,0.00636187437588,0.0156074891605,0.151915630049,0.00636187437588,0.165530516241,0.00465063703568,0.7665876828,0.782039254129,0.638852101367,0.0148199070736,0.0197556615281,0.0659386849446,0.0233671491129,0.0274992668572,0.0068758651402,0.0243892439834,0.016548822645,0.0165092215714,0.0235691023598,0.0137776796214,0.0108389692647,0.0200299870214,0.0158208194559,0.136698750907,0.0183922951312,0.0108666290365,0.00319897110896,0.00352559062884,0.00340017637127,0.00310366737855,0.00743451388943,0,0,0.00226506315454,0.00226506315454,0.00272300017124,0,0.00159046859397,0.00435666305088,0.00833301017716,0.00674254158319,0.00318093718794,0.00871332610176,0.00871332610176,0.00594713164485,0.00636187437588,0.00477140578191,0.00636187437588,0.00159046859397,0.00159046859397,0,0,0,0,0.0156329952576,0.0124614217996,0.0123052272426,0.00332776801496,0,0,0.0235604250871,0.0205062852754,0.00159046859397,0.00159046859397,0.00443759986205,0.0108022602052,0,0.00758910697789,0.042872742652,0,0,0,0.0205374756636,0,0,0,0.00346838508927,0.00216787556951,0.00390892326059,0.00781784652118,0.00604666604827,0.00768955958345,0.0109017146314,0.012587618788,0.00954249832977,0.0143484218372,0.00116949287805,0.0021871566269,0,0.0243284702568,0.00283095932639,0.00102104179321,0.00897890971083,0.00204208358642,0.0111660663377,0,0,0,0,0.00332776801496,0.00649934147293,0.014056198686,0.00317157345797,0,0.00410153185904,0.0647003263302,0.00826993895804,0.00378063632286,0,0.0446638329233,0.000961349923678,0.00186720282586,0.00186260580663,0.0444138756133,0.00340073494863,0.00628002363626,0.0196825372299,0.00198595471452,0.0189300031233,0.0107861573734,0.00416646190614,0.00314850655058,0.0220221369003,0.00639178475723,0.0043743132538,0.00886361588898,0.0110906488313,0.0021871566269
PF00004_AAA 132 ANP 12 16 0,0,0,0,0,0,0.293054223957,0,0.220890723765,0,0.902585657643,0.902585657643,0.902585657643,0,0.0365564138796,0.149544866261,0.0365564138796,0,0,0,0,0,0,0,0,0,0,0.11424794602,0,0,0,0,0.0359910406715,0.0608579284774,0,0,0,0,0,0,0,0,0,0.0608579284774,0.0608579284774,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0419485401186,0,0,0,0,0,0,0,0.14472176965,0,0,0,0.18127818353,0,0,0,0.0365564138796,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.11424794602,0,0.11424794602,0,0,0,0,0,0,0,0,0,0,0.129010569799,0,0,0,0.112988452381,0,0,0,0.112988452381,0,0,0,0,0,0.0419485401186,0,0,0,0,0,0,0,0
PF00004_AAA 132 PEPTIDE_ 6 9 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.12131242741,0.653927119629,0.188346883469,0,0,0,0,0,0,0,0,0,0,0,0,0,0.251899438637,0.0941734417344,0.0941734417344,0,0,0.251899438637,0.251899438637,0,0,0,0,0,0,0,0,0,0,0,0,0,0.188346883469,0,0,0,0,0.12131242741,0,0.24262485482,0.162299167635,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
PF00005_ABC_tran 137 SM_ 110 141 0.0377234744273,0.00461980790097,0.0096909780396,0.0157905225569,0.0169710021629,0.0275121349589,0.0156602458352,0.00405947701988,0.00739122457456,0.00768805945304,0,0.00161828107474,0.00591327371089,0.00202973850994,0.00579099946437,0.00416338695375,0.00323133939386,0,0.00548167659646,0.303901345366,0,0.280623915296,0.0067656772453,0.679542204351,0.697685738027,0.691214242456,0.00839368055968,0.00482361342933,0.0397878560196,0.0227702896963,0.00909516751323,0.0349315331438,0.0131520331553,0.0193392901203,0.0284420664734,0.0130227255943,0.012984391476,0.0106552933277,0.0204835408505,0.0189168490431,0.0285336657781,0.0275702117864,0,0.00457484438895,0,0,0,0.0238587624559,0.021829023946,0.00254510587901,0.019283918067,0,0.0217485765752,0.0237783150851,0.0213136565769,0.0276168909756,0.0118202358496,0.0114312928629,0.012052883022,0.0100482164063,0.0063542198101,0.00944680557652,0,0.00174289125479,0.0278798911495,0.097924307303,0.00184563007097,0.00635965898894,0.0322596343334,0.0297949758252,0.0490094392875,0.00790848545895,0.0292221420359,0.0249021779027,0,0.0234686984736,0.00517652112411,0.00215504189665,0,0,0.00764743755861,0.00471086684116,0.00909061002502,0.0225204772897,0.00473593873536,0,0.00399443207795,0.00197991580508,0,0,0.00270047617867,0.00251906964526,0.00289475535167,0.00277098703935,0.00270047617867,0.111023129603,0.0217327928995,0.019026506502,0.00270047617867,0.014177092878,0.0679676350364,0.0137430965532,0,0.190965797116,0,0.0585827481125,0.0127265974899,0.0652120185593,0.00794644519468,0.102944250852,0.0137201014491,0.0138465570152,0.0945003452098,0,0.0232318158738,0.00900928101363,0.00996736376179,0,0.0169296493927,0.018744119295,0.0136624119088,0.00214880200698,0.0166390770754,0.0058505131038,0.0142845969412,0.0157676346769,0.0166971239124,0.00307952606445,0,0,0.00264422779836,0.0129966476375,0.0352791479778,0.153867299413,0.00694183181232,0.00489649026725,0.0690208616258
PF00005_ABC_tran 137 METABOLITE_ 93 115 0.0423846282715,0.00244789601093,0.00981946362115,0.017596034204,0.0155837013976,0.0288244918514,0.0149514607241,0.00489579202186,0.00867883255515,0.00935790985627,0,0.00195760342646,0.00193456357756,0.00244789601093,0.00692361173264,0.00509590025931,0.00386912715512,0,0.00645575857075,0.248906074717,0,0.288270775052,0.00801153357221,0.663026833878,0.683136475548,0.681364675625,0.0102468162603,0.00641783224383,0.0346052146251,0.0257845741747,0.0109370965404,0.0417062809958,0.0158098487134,0.0232187100008,0.0311952501979,0.0155601075809,0.0136836831039,0.0109042741005,0.0224747866107,0.0219811610125,0.0265527236302,0.0273664044944,0,0.00558619284378,0,0,0,0.025763062536,0.023315166525,0.00313829683285,0.0201768696922,0,0.0201768696922,0.0226247657031,0.0226247657031,0.0290292426008,0.0110260155152,0.0135234901915,0.0144110449998,0.0119763190905,0.00752124098774,0.0111141050209,0,0.00207540533685,0.0302486042934,0.0524527668327,0.00227251781357,0.00765900406687,0.0354870010588,0.0354870010588,0.0526979917992,0.00665072854151,0.0292754942446,0.0269726888535,0,0.0252313733622,0.00629239146422,0.00260660765911,0,0,0.00886792919754,0.00555046745964,0.010788864225,0.0269824218852,0.00556363756128,0,0.00488547577917,0.00242060730148,0,0,0.00352968911807,0.00386606577791,0.00326618240823,0.00330305433987,0.00352968911807,0.115233025287,0.0130981412706,0.00653821951543,0.00352968911807,0.010863080338,0.0663599175597,0.0052259159312,0,0.184754657886,0,0.0599124302967,0.0162860401874,0.0573467317214,0.00555046745964,0.0728912112544,0.0135766150981,0.00528373197268,0.0739797676168,0,0.0166957068247,0.00427115563956,0.0118994288571,0,0.0185147541652,0.0118994288571,0.0147191112135,0.0026129579656,0.0179293823307,0.00330305433987,0.0154092075878,0.016929152311,0.0142435985256,0,0,0,0.00317755916369,0.0136260979711,0.0335947867361,0.121778722136,0.00840347509489,0.00584833317077,0.0677237376771
PF00005_ABC_tran 137 ANP 17 19 0,0,0.0235131301701,0,0,0.0235131301701,0.0470262603402,0,0,0,0,0,0,0,0,0,0,0,0,0.841006106374,0,0.271660563153,0,0.676289118674,0.676289118674,0.632069683936,0,0,0.0823856755492,0,0,0,0,0,0.031959449247,0,0.0235131301701,0.0235131301701,0.0235131301701,0.0470262603402,0.0940525206804,0.0705393905103,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0470262603402,0,0,0,0,0,0,0,0,0,0.430099299531,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.102133756407,0.102133756407,0,0,0.062410917336,0.0597398340485,0,0.125591324241,0,0,0,0.117715952724,0.0255334391017,0.360214252687,0.0347094361003,0.0597398340485,0.303224405492,0,0.0597398340485,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0442194347374,0.526216121352,0,0,0.229658360646
PF00006_ATP-synt_ab 213 SM_ 48 83 0,0,0,0.00418690801436,0,0,0.0203667663316,0,0.00386900178183,0,0,0.00386900178183,0,0,0,0.00386900178183,0,0,0,0,0,0,0.0988488771734,0.175445150161,0.0695603302947,0.00173400036405,0.57566732122,0.612799884263,0.638785090543,0.0886847595391,0.0538437941832,0.030641262013,0.0880153017934,0.0615367639825,0.0867187752413,0.0612003445172,0.0836536701719,0.0573313427354,0.0417983345638,0.0133194461894,0.0230254503209,0.019351851181,0.00386900178183,0.0339285033528,0.0242224992213,0.0121112496106,0,0,0,0,0.0739242998399,0.146139675162,0,0.00522074083795,0.0299399626262,0,0.0247466468246,0.0331204628534,0,0.00969600582193,0.00969600582193,0.0022941917051,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00386900178183,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0429700305796,0.0325605349484,0.0325605349484,0,0,0,0.0247466468246,0.0247466468246,0,0,0.0163728307959,0,0.00297268548145,0,0,0,0,0,0,0,0,0,0.00787881465062,0.00393940732531,0,0,0,0,0,0,0,0,0,0,0,0.00297268548145,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0459784485426,0,0,0.00173400036405,0,0,0,0,0.0125607240431,0.017819954343,0,0.0125607240431,0.0121859227816,0,0,0,0.0626004689813,0,0,0.00173400036405,0.0043609421676,0,0,0,0,0.0127797902401,0,0,0.0389426376458,0.199928757533,0.0125562590955,0.0022941917051,0.0117102485569,0.0177066053529,0.475712856261,0.00868954844585,0.014584989095,0.025337533588,0.0970702786178,0.0297709240626,0.045010602041,0.0189803777073,0.0471133143059,0.114051762747,0.0518791397563

You can see that the binding_frequencies column contains per residue scores from 0 to 1 that say how likely each residue is to be involved in binding. Each value is comma separated and there are as many values as there are match states in the HMM. That means the number of values in binding_frequencies matches the value of domain_length. Also notice that the ligand_type varies and you can have multiple ligands for a given Pfam. I hope its appreciated that these files give the essential output of the InteracDome workflow, and there are two such files. The first is the non-redundant representable set, which “correspond to domain-ligand interactions that had nonredundant instances across three or more distinct PDB structures. [Kobren and Singh] recommend using this collection to learn more about domain binding properties”. If I recall, there are 2375 such Pfam IDs in this set (this is compared to the roughly 15,000-20,000 total Pfam IDs). The second set is the confident set, which “correspond to domain-ligand interactions that had nonredundant instances across three or more distinct PDB entries and achieved a cross-validated precision of at least 0.5. [Kobren and Singh] recommend using this collection to annotate potential ligand-binding positions in protein sequences”.

Description of the HMM profiles

Since each entry in the InteracDome dataset corresponds to a Pfam, we just need the Pfam database, and that can be readily downloaded from the EBI ftp server, or by running anvi-setup-pfams. But since all the Pfams are not present in the InteracDome dataset, we merely need a subset of the Pfam database that contains Pfams with at least one entry from the non-redundant representable InteracDome set. From now on I’ll just call this subset the InteracDome Pfams (IPfams). Also, it needs to be said that InteracDome was carried out using Pfam 31.0, so it is critical that is what we use too.

anvi-setup-interacdome

Acquiring the InteracDome’s tab-separated files and the IPfam HMM profiles are just one-time setups, and therefore I created a program called anvi-setup-interacdome to handle these tasks. So to make these inputs available, the user must run anvi-setup-interacdome which initiates a class called InteracDomeSetup. Here is the class:

This is merely a snapshot of the class as of July 12th, 2020.

class InteracDomeSetup(object):
    def __init__(self, args, run=terminal.Run(), progress=terminal.Progress()):
        """Setup a Pfam database for anvi'o

        Parameters
        ==========
        args : argparse.Namespace
            See `bin/anvi-setup-interacdome` for available arguments
        """

        self.run = run
        self.progress = progress
        self.interacdome_data_dir = args.interacdome_data_dir

        self.pfam_setup = None

        self.interacdome_files = {
            'representable_interactions.txt': 'https://interacdome.princeton.edu/session/344807c477180230032a2a3807ba47c3/download/downloadBP?w=',
            'confident_interactions.txt': 'https://interacdome.princeton.edu/session/344807c477180230032a2a3807ba47c3/download/downloadConfidentBP?w=',
        }

        if self.interacdome_data_dir and args.reset:
            raise ConfigError("You are attempting to run InteracDome setup on a non-default data directory (%s) using the --reset flag. "
                              "To avoid automatically deleting a directory that may be important to you, anvi'o refuses to reset "
                              "directories that have been specified with --interacdome-data-dir. If you really want to get rid of this "
                              "directory and regenerate it with InteracDome data inside, then please remove the directory yourself using "
                              "a command like `rm -r %s`. We are sorry to make you go through this extra trouble, but it really is "
                              "the safest way to handle things." % (self.interacdome_data_dir, self.interacdome_data_dir))

        if not self.interacdome_data_dir:
            self.interacdome_data_dir = constants.default_interacdome_data_path

        self.run.warning('', header='Setting up InteracDome', lc='yellow')
        self.run.info('Data directory', self.interacdome_data_dir)
        self.run.info('Reset contents', args.reset)

        filesnpaths.is_output_dir_writable(os.path.dirname(os.path.abspath(self.interacdome_data_dir)))

        if not args.reset and not anvio.DEBUG:
            self.is_database_exists()

        filesnpaths.gen_output_directory(self.interacdome_data_dir, delete_if_exists=args.reset)


    def is_database_exists(self):
        """Raise ConfigError if database exists

        Currently, this primitively decides if the database exists by looking at whether Pfam-A.hmm
        or Pfam-A.hmm.gz exists.
        """

        if (os.path.exists(os.path.join(self.interacdome_data_dir, 'Pfam-A.hmm') or 
            os.path.exists(os.path.join(self.interacdome_data_dir, 'Pfam-A.hmm.gz')))):
            raise ConfigError("It seems you already have the InteracDome data downloaded in '%s', please "
                              "use --reset flag if you want to re-download it." % self.interacdome_data_dir)


    def download_interacdome_files(self):
        """Download the confident and representable non-redundant InteracDome datasets

        These datasets can be found at the interacdome webpage: https://interacdome.princeton.edu/
        """

        for path, url in self.interacdome_files.items():
            utils.download_file(
                url,
                os.path.join(self.interacdome_data_dir, path),
                check_certificate=False,
                progress=self.progress,
                run=self.run
            )


    def download_pfam(self):
        """Setup the pfam data subset used by interacdome

        Currently, interacdome only works for pfam version 31.0, so that is the version downloaded here.
        """

        pfam_args = argparse.Namespace(
            pfam_data_dir=self.interacdome_data_dir,
            pfam_version='31.0',
            reset=False,
        )

        self.pfam_setup = pfam.PfamSetup(pfam_args)
        self.pfam_setup.get_remote_version()
        self.pfam_setup.download(hmmpress_files=False)


    def load_interacdome(self, kind='representable'):
        """Loads the representable interacdome dataset as pandas df"""

        data = InteracDomeTableData(kind=kind, interacdome_data_dir=self.interacdome_data_dir)
        return data.get_as_dataframe()


    def get_interacdome_pfam_accessions(self):
        """Get the representable interacdome Pfam accessions"""

        return set(self.load_interacdome(kind='representable').index.tolist())


    def filter_pfam(self):
        """Filter Pfam data according to whether the ACC is in the InteracDome dataset"""

        interacdome_pfam_accessions = self.get_interacdome_pfam_accessions()

        hmm_profiles = pfam.HMMProfile(os.path.join(self.interacdome_data_dir, 'Pfam-A.hmm'))
        hmm_profiles.filter(by='ACC', subset=interacdome_pfam_accessions)
        hmm_profiles.write(filepath=None) # overwrites

        # hmmpresses the new .hmm
        self.pfam_setup.hmmpress_files()

        # We also filter out the Pfam-A.clans.tsv file, since it is used as a catalog
        clans_file = os.path.join(self.interacdome_data_dir, 'Pfam-A.clans.tsv')
        clans = pd.read_csv(clans_file, sep='\t', header=None)
        clans[clans[0].isin(interacdome_pfam_accessions)].to_csv(clans_file, sep='\t', index=False, header=False)


    def setup(self):
        """The main method of this class. Sets up the InteracDome data directory for usage"""

        self.run.warning('', header='Downloading InteracDome tables', lc='yellow')
        self.download_interacdome_files()

        self.run.warning('', header='Downloading associated Pfam HMM profiles', lc='yellow')
        self.download_pfam()

        self.run.warning('', header='Filtering Pfam HMM profiles', lc='yellow')
        self.filter_pfam()

First, anvi’o creates a directory under anvio/data/misc/InteracDome (by default) which will end up housing the IPfam HMM profiles and the InteracDome datasets. Then the main method, setup is called, which is directly above us. We can see that first, anvi’o downloads the tab-separated files. Since these files were created for Pfam version 31.0, anvi’o next downloads a copy of the Pfam 31.0 HMMs. There already exists a framework to download Pfams because of anvi-setup-pfams, so this was very little work to do. Yay for code resusability. As a final step, the Pfam HMMs are subset to the IPfam HMMs, i.e. only the Pfams that are in the InteracDome dataset.

(3) Running the hidden Markov model (HMM)

HMMs are by no means an elementary topic, and so rather than butcher an explanation with my limited understanding, I defer to this wonderful paper.

With both the required inputs being gathered and setup with anvi-setup-interacdome, the next thing I focused on was actually running an HMM on user genes. This as well as the remainder of steps will be carried out by the program anvi-run-interacdome.

The program that goes hand-in-hand with Pfam HMM profiles is HMMER, and in fact, anvi’o already has a driver to run HMMER on a set of user genes which was written for anvi-run-pfams. Further in fact(?), anvi-run-pfams in some ways does exactly what I want: it takes genes from the contigs database, runs an HMM of the user genes against the Pfam HMM profiles, and annotates genes with the best Pfam hit. The primary difference between anvi-run-interacdome and anvi-run-pfams, is that rather than simply attribute to each gene the Pfam ID of the best hit, we instead must keep track of residue-level information associated to each hit, so that we can associate the binding frequencies from the InteracDome dataset. In this sense, things are much more complicated, as I will need to write an in-depth parser of HMMER’s output to keep track of these things.

Currently, anvi-run-pfams can be run with either of the HMMER programs, hmmscan or hmmsearch. However, for good reason, hmmsearch in practice runs much faster for querying user genes against a HMM profile database with a size comparable to the Pfam database. Since hmmsearch is faster for our use-case and because I wanted to avoid creating 2 output parsers (one for hmmscan and one for hmmsearch), anvi-run-interacdome is by design only compatible with hmmsearch.

All in all, anvi-run-interacdome runs hmmsearch almost exactly like anvi-run-pfams does. In fact, the responsible class for anvi-run-interacdome, anvio.interacdome.InteracDomeSuper, inherits anvio.pfam.Pfam to manage a lot of the boiler-plate code.

I chose to apply the HMMER filter parameter --cut_ga that uses the Pfam GA (gathering threshold) score cutoffs to pre-filter the output of hmmsearch. This cuts out a lot of crap to sift through.

(4) Parsing HMMER’s standard output file

hmmsearch has 2 outputs. One is a tabular output like this:

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
3609                 -            333 2-Hacid_dh           PF00389.29   134   8.8e-20   68.1   0.1   1   1   8.7e-23   1.2e-19   67.7   0.1     9   133    20   325     8   326 0.95 -
5374                 -            320 2-Hacid_dh           PF00389.29   134     2e-10   37.8   0.0   1   1   2.6e-13   3.6e-10   37.0   0.0    21   103    24   195     5   310 0.71 -
3609                 -            333 2-Hacid_dh_C         PF02826.18   178     1e-55  185.2   0.0   1   1   1.6e-58   1.5e-55  184.7   0.0     2   177   118   293   117   294 0.98 -
5374                 -            320 2-Hacid_dh_C         PF02826.18   178   1.7e-53  178.0   0.1   1   1   3.2e-56   2.9e-53  177.2   0.1     2   178   108   282   107   282 0.96 -
3608                 -            311 2-Hacid_dh_C         PF02826.18   178   2.4e-07   27.7   0.0   1   1   4.4e-10     4e-07   26.9   0.0    36   144    12   123     5   130 0.85 -
2301                 -            162 2Fe-2S_thioredx      PF01257.18   145   3.8e-37  124.7   0.2   1   1   1.6e-40   4.3e-37  124.5   0.2     4   144    12   154     9   155 0.95 -
2121                 -            441 2HCT                 PF03390.14   416  8.5e-157  519.4  35.0   1   1  3.6e-160  9.9e-157  519.2  35.0     1   415    24   434    24   435 0.99 -
1998                 -            329 3Beta_HSD            PF01073.18   282   3.1e-22   76.2   0.0   1   1   1.1e-23   3.9e-21   72.6   0.0     1   237     4   243     4   254 0.74 -
504                  -            316 3Beta_HSD            PF01073.18   282   7.7e-18   61.8   0.1   1   1   3.4e-20   1.2e-17   61.2   0.1     1   229     5   232     5   242 0.78 -
[...]

The format is tabularized and thus easy enough to parse, but unfortunately the information is not detailed enough since it contains no information about residue-by-residue alignment of the match states (residues in the HMM profile) to the user gene sequence. The second, more verbose output, does:

# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query HMM file:                  /Users/evan/Software/anvio/anvio/data/misc/InteracDome/Pfam-A.hmm
# target sequence database:        /var/folders/58/mpjnklbs5ql_y2rsgn0cwwnh0000gn/T/tmpyebp3gbo/AA_gene_sequences.fa.0
# output directed to file:         /var/folders/58/mpjnklbs5ql_y2rsgn0cwwnh0000gn/T/tmpyebp3gbo/AA_gene_sequences.fa.0_output
# per-dom hits tabular output:     /var/folders/58/mpjnklbs5ql_y2rsgn0cwwnh0000gn/T/tmpyebp3gbo/AA_gene_sequences.fa.0_table
# model-specific thresholding:     GA cutoffs
# number of worker threads:        1
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       14-3-3  [M=222]
Accession:   PF00244.19
Description: 14-3-3 protein
Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------

   [No hits detected that satisfy reporting thresholds]


Domain annotation for each sequence (and alignments):

   [No targets detected that satisfy reporting thresholds]


Internal pipeline statistics summary:
-------------------------------------
Query model(s):                            1  (222 nodes)
Target sequences:                       2754  (847468 residues searched)
Passed MSV filter:                       207  (0.0751634); expected 55.1 (0.02)
Passed bias filter:                       80  (0.0290487); expected 55.1 (0.02)
Passed Vit filter:                         6  (0.00217865); expected 2.8 (0.001)
Passed Fwd filter:                         0  (0); expected 0.0 (1e-05)
Initial search space (Z):               2754  [actual number of targets]
Domain search space  (domZ):               0  [number of targets reported over threshold]
# CPU time: 0.01u 0.00s 00:00:00.01 Elapsed: 00:00:00.01
# Mc/sec: 14884.77
//
Query:       2-Hacid_dh  [M=134]
Accession:   PF00389.29
Description: D-isomer specific 2-hydroxyacid dehydrogenase, catalytic domain
Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------
    8.8e-20   68.1   0.1    1.2e-19   67.7   0.1    1.1  1  3609
      2e-10   37.8   0.0    3.6e-10   37.0   0.0    1.5  1  5374


Domain annotation for each sequence (and alignments):
>> 3609
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !   67.7   0.1   8.7e-23   1.2e-19       9     133 ..      20     325 ..       8     326 .. 0.95

  Alignments for each domain:
  == domain 1  score: 67.7 bits;  conditional E-value: 8.7e-23
                 HHHHHHHHH.TTE.EEEEESCSSHHHCHHHHTTESEEEE-TTS-BSHHHHHHHTT--EEEESSSSCTTB-HHHHHHTT-EEEE-TT.TTHHHHHHHH.. CS
                 xxxxxxxxx.xxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxx.. RF
  2-Hacid_dh   9 eeelellke.egv.evevkdellteellekakdadalivrsntkvtaevlealpkLkviaragvGvDnvDldaakerGilVtnvpg.ystesvAElt.. 102
                 +e l +l++ ++v +++v +e+  +el+e ++++ ++i  ++  +t+e++e+ ++L +i+r+g+G++n+Dldaak++  +V+ +p+ +  ++vAE
        3609  20 PEHLTRLEKiGTVkHFTVDSEIGGKELAECLQGYTIIIASVTPFFTKEFFEHKDELLLISRHGIGYNNIDLDAAKQHDTIVSIIPAlVERDAVAENNvt 118
                 667777888655556777779*****************************************************************777889***999* PP

                 ................................................................................................... CS
                 ................................................................................................... RF
  2-Hacid_dh 103 ................................................................................................... 102

        3609 119 nllavlrqtvaadasvkadqwekranfvgrtlfnktvgvigvgntgscvvetlrngfrcdvlaydpyksatylqsygakkvdldtllasadiiclcanl 217
                 *************************************************************************************************** PP

                 .............................................................................T-BHHHHHHHHHHHHHHHHHHH CS
                 .............................................................................xxxxxxxxxxxxxxxxxxxxxx RF
  2-Hacid_dh 103 .............................................................................aaTeeaqeniaeeaaenlvafl 124
                                                                                              a+T e +++++e++++ +++++
        3609 218 teesyhmigsaeiakmkdgvylsnsargalideeamiaglqsgkiaglgtdvleeepgrknhpylafenvvmtphtsAYTMECLQAMGEKCVQDVEDVV 316
                 *************************************************************************************************** PP

                 TTCCGTTBC CS
                 xxxxxxxxx RF
  2-Hacid_dh 125 kgespanav 133
                 +g  p+  v
        3609 317 QGILPQRTV 325
                 **7777655 PP

>> 5374
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !   37.0   0.0   2.6e-13   3.6e-10      21     103 ..      24     195 ..       5     310 .. 0.71

  Alignments for each domain:
  == domain 1  score: 37.0 bits;  conditional E-value: 2.6e-13
                 EEEEESCSSHHHCHHHHTT.ESEEEE-TTS-BSHHHHHHH.TT--EEEESSSSCTTB-HHHHHHTT-EEEE-TTTTHHHHHHHH............... CS
                 xxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx............... RF
  2-Hacid_dh  21 evevkdellteellekakd.adalivrsntkvtaevleal.pkLkviaragvGvDnvDldaakerGilVtnvpgystesvAElt............... 102
                   ++  + +t+ l ++ ++ +++++ +  +++  ++l+   + Lk+i+++++G D  D+d ++e+Gil++n  g ++ s+ E++
        5374  24 APNYLVKTSTDYLSSAEEEaIEIMLGWH-KEIGPRLLASDtSHLKWIQLISAGADYMDFDKLREKGILLSNGSGIHSVSISEHVlgvllahtrglqesi 121
                 4444446666655555555344555555.559999998888************************************************9999888888 PP

                 .........................................................................T CS
                 .........................................................................x RF
  2-Hacid_dh 103 .........................................................................a 103

        5374 122 qqqmqhtwnqtapsyqqlsgqkmlivgtgqigqqlakfakglnlqvygvntsghvtegfiecysqknmskiihE 195
                 88888888888888888888888888888888888888888888888887777777775554444444444440 PP

[...]

To write the parser I made extensive use of the detailed output description in the HMMER user guide (search for “Step 2: search the sequence database with hmmsearch”) for the relevant section. I won’t reiterate the meaning of each line, so check the guide if you’re keen. The most important thing to note is that this more verbose output contains the same tabular information as the first output (albeit in more scattered fashion), and additionally has alignments of the HMM profile to the query gene.

Consider this section of the output:

Query:       2-Hacid_dh  [M=134]
Accession:   PF00389.29
Description: D-isomer specific 2-hydroxyacid dehydrogenase, catalytic domain
Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------
    8.8e-20   68.1   0.1    1.2e-19   67.7   0.1    1.1  1  3609
      2e-10   37.8   0.0    3.6e-10   37.0   0.0    1.5  1  5374

I will refer to this as the domain hits summary. In this case, the IPfam PF00389.29 hit to 2 user genes: 3609 and 5374. Then, for each of these genes, we can find further down in the file the alignment of each. For example, here is where the alignment info for gene 3609 starts:

>> 3609
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !   67.7   0.1   8.7e-23   1.2e-19       9     133 ..      20     325 ..       8     326 .. 0.95

  Alignments for each domain:
  == domain 1  score: 67.7 bits;  conditional E-value: 8.7e-23
                 HHHHHHHHH.TTE.EEEEESCSSHHHCHHHHTTESEEEE-TTS-BSHHHHHHHTT--EEEESSSSCTTB-HHHHHHTT-EEEE-TT.TTHHHHHHHH.. CS
                 xxxxxxxxx.xxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxx.. RF
  2-Hacid_dh   9 eeelellke.egv.evevkdellteellekakdadalivrsntkvtaevlealpkLkviaragvGvDnvDldaakerGilVtnvpg.ystesvAElt.. 102
                 +e l +l++ ++v +++v +e+  +el+e ++++ ++i  ++  +t+e++e+ ++L +i+r+g+G++n+Dldaak++  +V+ +p+ +  ++vAE
        3609  20 PEHLTRLEKiGTVkHFTVDSEIGGKELAECLQGYTIIIASVTPFFTKEFFEHKDELLLISRHGIGYNNIDLDAAKQHDTIVSIIPAlVERDAVAENNvt 118
                 667777888655556777779*****************************************************************777889***999* PP

[...]

I will refer to this as the hit alignment. The aggregation of all the domain hit summaries and the hit alignments are stored in two datastructures that are attributes of the anvio.parsers.hmmer.HMMERStandardOutput parser class. I will now describe both of them:

The first is dom_hits, which is a dataframe which looks like this:

  pfam_name pfam_id corresponding_gene_call domain qual score bias c-evalue i-evalue hmm_start hmm_stop hmm_bounds ali_start ali_stop ali_bounds env_start env_stop env_bounds mean_post_prob match_state_align comparison_align sequence_align version
0 Beta_elim_lyase PF01212 1762 1 ! 20.9 0.1 1e-08 3.5e-06 33 169 .. 44 177 .. 34 215 .. 0.72 tvnrLedavaelfgke..aalfvpqGtaAnsill.kill.qr..geevivtepahihfdetgaiaelagvklrdlknkeaGkmdlekleaaikevgaheekiklisltvTnntagGqvvsleelrevaaiakkygiplhlDgA ++ +++ael+ + f+ Gt +++ l + + +r g+ +i++ h +et + g +l ++ +++G +++e+l+++i++ e i + +++v n+ G++ +++e+ ev +a+ +i++h+D+ LLQQARKQIAELINVSanEIYFTSGGTEGDNWVLkGTAIeKRefGNHIIISAVEHPAVTETAEQLVELGFELSYAPVDKEGRVKVEELQKLIRK—–ETILVSVMAVNNE–VGTIQPIKEISEV–LAEFPKIHFHVDAV 20
1 PAPS_reduct PF01507 1541 1 ! 36.1 0.1 3.6e-13 1.3e-10 2 164 .. 21 231 .. 20 234 .. 0.79 lvvsvsgGkdslVllhLalkafkpv….pvvfvdtghefpetiefvdeleeryglrlkvyepeeevaekinaekhgs.slyee.aaeriaKveplkk……………………………aLekldedall..tGaRrdesksraklpiveidedfek………slrvfPllnWteedvwqyilrenipynpLydqgfr + +s+sgGkds +++La + ++ ++ ++ + ++ t++f++++e+ +++ +++ ++++ + + +++ + + + e+ + p k e++ ++a+ +G+R++es +r++ +++ +++ + ++Pl++W+ d+w+ + +++yn +y++ ++ VYFSFSGGKDSGLMVQLANLVAEKLdrnfDLLILNIEANYTATVDFIKKIEQLPRVKNIYHFCLPFFEDNNTSFFQPQwKMWDPsEKEKWIHSLP–KnaitleniddglkkyyslsngnpdrflryfqnwYKEQYPQSAIScgVGIRAQESLHRHSAVTKGENKYKNRcwinitlegNILFYPLFDWKVGDIWAATFKCELEYNYIYEKMYK 18
2 Ank_2 PF12796 1756 1 ! 32.2 0 6.7e-12 2.3e-09 29 84 .] 74 135 .. 53 135 .. 0.85 aLhyAakngnleivklLle…h.a..adndgrtpLhyAarsghleivklLlekgadinlkd aL Aa + +++ vk +l+ + + +d +g+tpL +A+ ++ +ei+k L+++gadinl++ ALLEAANQRDTKKVKEILQdttYqVdeVDTEGNTPLNIAVHNNDIEIAKALIDRGADINLQN 6
3 Ank_2 PF12796 1756 2 ! 28.5 0 9.5e-11 3.3e-08 22 75 .. 199 265 .. 195 267 .] 0.76 pn..k.ngktaLhyAak..ngnl…eivklLleha…..adndgrtpLhyAarsghleivklLle ++ + +g taL+ A+ +gn +ivklL+e++ dn+grt++ yA ++g++ei k+L + IDfqNdFGYTALIEAVGlrEGNQlyqDIVKLLMENGadqsiKDNSGRTAMDYANQKGYTEISKILAQ 6
4 IGPS PF00218 1615 1 ! 20.6 0.1 1.2e-08 4e-06 202 249 .. 195 242 .. 73 248 .. 0.88 LaklvpkdvllvaeSGiktredveklkeegvnafLvGeslmrqedvek +++lv+++++++ae i+t+e+++++k+ gv ++ vG +++r ++ +k IKQLVQENICVIAEGKIHTPEQARQIKKLGVAGIVVGGAITRPQEIAK 20
5 Ribosomal_L33 PF00471 1562 1 ! 66.6 1.5 1.1e-22 3.7e-20 2 47 .] 4 49 .] 3 49 .] 0.97 kvtLeCteCksrnYtttknkrntperLelkKYcprcrkhtlhkEtK +++LeC e+++r Y t+knkrn+perLelkKY p++r++ ++kE K NIILECVETGERLYLTSKNKRNNPERLELKKYSPKLRRRAIFKEVK 19
6 Ribosomal_S14 PF00253 1565 1 ! 83.3 0.1 3.9e-28 1.3e-25 2 54 .] 36 88 .. 35 88 .. 0.98 laklprnssptrirnrCrvtGrprGvirkfgLsRicfRelAlkgelpGvkKaS laklpr+s+p+r+r r++ +GrprG++rkfg+sRi+fRel ++g +pGvkKaS LAKLPRDSNPNRLRLRDQTDGRPRGYMRKFGMSRIKFRELDHQGLIPGVKKAS 20
7 Polysacc_synt_C PF14667 1593 1 ! 61.4 19.2 5.4e-21 1.9e-18 2 139 .. 371 516 .. 370 519 .. 0.83 LailalsiiflslstvlssiLqglgrqkialkalvigalvklilnllliplfgivGaaiatvlallvvavlnlyalrrllgikl…llrrllkpllaalvmgivvylllllllglllla…al..alllavlvgalvYllllll L+ ++s+ +l+++t++ siLq+l +k+a+ ++ i++l+kli+++++i+lf +G +iat+++ ++++++ +++l+r++ i++ ++ +++ +++vm i+ +l+l+++ ++ + +l + l +++g++v+ + l++ LSATIISTSLLGIFTIVLSILQALSFHKKAMQITSITLLLKLIIQIPCIYLFKGYGLSIATIICTMFTTIIAYRFLSRKFDINPikyNRKYYSRLVYSTIVMTILSLLMLKIISSVYKFEstlQLffLISLIGCLGGVVFSVTLFR 5

It stores all the domain hit information in the domain hit summaries as well as the sequence of the consensus match states, the comparison string, and the sequence of the user gene. This table contains all essential information, however the alignment information is in the form of raw strings (see match_state_align, comparison_align, and sequence_align columns). To further process the alignment info into a more meaningful form, there is a second data structure called ali_info. ali_info is a nested dictionary. It looks like this:

ali_info = {
    # here is the template
    <gene_callers_id>: {
        (<pfam_id>, <domain_id>): {
            <a_dataframe_with_alignment_info>
        },
    },
    # here is an example
    1699: {
        ('PF07972', 1): {
            <a_dataframe_with_alignment_info>
        },
        ('PF07972', 2): {
            <another_dataframe_with_alignment_info>
        },
        ('PF03383', 1): {
            <yet_another_dataframe_with_alignment_info>
        },
    },
}

Each dataframe contains detailed alignment info for a given hit. In the above example, if we wanted to access alignment info about the hit of PF00389.29 against the user gene with ID 3609, we would access the dataframe via ali_info[3609][('PF00389', 1)]. Since PF00389 hit only once to 3609, the hit has the domain ID 1. This entry in the nested dictionary is a dataframe that looks like this:

  seq hmm comparison seq_positions hmm_positions
0 P E + 19 8
1 E E E 20 9
2 H E   21 10
3 L L L 22 11
4 T E   23 12
5 R L + 24 13
6 L L L 25 14
7 E K + 26 15

This dataframe is a per-residue characterization of the alignment strings, i.e. the columns match_state_align, sequence_align, and comparison_align found in dom_hits. For convenience, positions in the alignment that contain a gap either in the HMM or in the gene sequence are stripped from this dataset. seq_positions correspond to the 0-indexed positions in the user gene (in this example the gene is 3609). In other words, these are the codon_order_in_gene values that aligned to a non-gap position in the HMM. Why doesn’t seq_positions start at 0? Because in this example, the hit starts at the 20th amino acid in the user gene. Sanity check: this means the ali_start entry in the corresponding row of self.dom_hits would be 20.

Together, these 2 data structures are the fruits of labor from parsing the HMMER standard out and provide generic utility that extends beyond just anvi-run-interacdome.

(5) Filtering HMM hits

Pfam gathering threshold

As previously mentioned, hmmsearch is ran with the --cut_ga flag, which uses Pfam gathering thresholds to pre-filter a lot of poor-quality hits.

Filtering partial hits

In the InteracDome paper, Kobren and Singh are more stringent about filtering hits than I decided to be. In the paper, it is demanded that 100% of the Pfam HMM maps to the user gene, i.e. the first and last position of the HMM hit should be non-gap characters, or else the hit is discarded. In other words, they discarded all partial hits. I have opted to relax this constraint since it seemed too strict for my applications, however, I did want to enforce some kind of threshold that would discard hits that are very partial. To quantify the partialness of a hit I defined the hit fraction quantity, which is defined as

where $K$ is the alignment length (hmm_stop - hmm_start) and $L$ is the length of the HMM profile (the number of match states). For example, in the hit below, PF01212 has length of $293$ (not shown below, but trust me), and the alignment length is $169-33=136$. This means the hit fraction is $f = 136/293 \approx 0.46$.

  pfam_name pfam_id corresponding_gene_call domain qual score bias c-evalue i-evalue hmm_start hmm_stop hmm_bounds ali_start ali_stop ali_bounds env_start env_stop env_bounds mean_post_prob match_state_align comparison_align sequence_align version
0 Beta_elim_lyase PF01212 1762 1 ! 20.9 0.1 1e-08 3.5e-06 33 169 .. 44 177 .. 34 215 .. 0.72 tvnrLedavaelfgke..aalfvpqGtaAnsill.kill.qr..geevivtepahihfdetgaiaelagvklrdlknkeaGkmdlekleaaikevgaheekiklisltvTnntagGqvvsleelrevaaiakkygiplhlDgA ++ +++ael+ + f+ Gt +++ l + + +r g+ +i++ h +et + g +l ++ +++G +++e+l+++i++ e i + +++v n+ G++ +++e+ ev +a+ +i++h+D+ LLQQARKQIAELINVSanEIYFTSGGTEGDNWVLkGTAIeKRefGNHIIISAVEHPAVTETAEQLVELGFELSYAPVDKEGRVKVEELQKLIRK—–ETILVSVMAVNNE–VGTIQPIKEISEV–LAEFPKIHFHVDAV 20

To apply a filter based on partialness, I created a threshold value, self.min_hit_threshold (specified with --min-hit-fraction) that the hit fraction of a hit must exceed in order to be kept. If self.min_hit_threshold = 0.5, then the above hit would be removed since its hit fraction is less than this value ($\approx 0.46$). To determine a reasonable default value for this threshold, I calculated the hit fractions for all hits of an E. faecalis genome against the IPfams which yielded 2898 hits. Here is the resulting histogram of hit fractions:

hit-fraction-hist

From this histogram, I deemed 0.8 was a good compromise between data retention and quality, so that is the default on July 21st, 2020.

Filtering bad hits with information content

Really, filtering bad hits is what the Pfam gathering threshold is designed for. But to be extra sure that we are not including junk, I followed the protocol of the InteracDome paper. To understand what Kobren and Singh did, we must first make a one paragraph detour to define some terms.

An HMM profile is composed of match states, one for each residue position in the HMM profile. And each match state is composed of 20 emission probabilities, one for each amino acid. An emission probability gives a probability that a given amino acid is observed for the match state. So match states which are highly conserved will have a single dominant emission probability, whereas an unconserved match state will have more uniformly distributed emission probabilities, meaning many amino acids can inhabit the match state. To quantify how conserved a match state is, there is a quantity used in sequence logos called information content (IC), which compares the entropy of a perfectly uniform distribution of emission probabilities to the observed entropy of emission probabilities. The formula for IC boils down to:

where $f_i$ is the emission probability of the $i$th amino acid. The equation’s worth boils down to this: the higher the information content, the more conserved the match state is.

With this in mind, Kobren and Singh identified each match state in a hit that had an IC exceeding 4 (very conserved) and took note of the consensus amino acid (the amino acid with the highest emission probability). Then, in order to retain a hit, the gene sequence this HMM profile hit to must share all the same amino acids as the consensus amino acids at each of these conserved positions. The idea is that if one is to trust the quality of the hit, these positions should equate since they are conserved. I wanted to replicate this filtering procedure.

This is great and all, except the HMMER output does not provide emission probabilities, and therefore calculating IC is a non-trivial task…

Calculating information content

To calculate information content (IC), one must dive into the .hmm file itself (the one created during anvi-setup-interacdome). An .hmm file contains all the information necessary about each HMM profile. Here is the content for just one HMM profile in the .hmm:

HMMER3/f [3.1b2 | February 2015]
NAME  SnoaL_2
ACC   PF12680.6
DESC  SnoaL-like domain
LENG  102
ALPH  amino
RF    no
MM    no
CONS  yes
CS    yes
MAP   yes
DATE  Fri Jan 20 17:09:36 2017
NSEQ  346
EFFN  15.065170
CKSUM 2037425494
GA    27.00 27.00;
TC    27.00 27.00;
NC    26.90 26.90;
BM    hmmbuild HMM.ann SEED.ann
SM    hmmsearch -Z 26740544 -E 1000 --cpu 4 HMM pfamseq
STATS LOCAL MSV       -9.2504  0.71708
STATS LOCAL VITERBI  -10.2672  0.71708
STATS LOCAL FORWARD   -3.5453  0.71708
HMM          A        C        D        E        F        G        H        I        K        L        M        N        P        Q        R        S        T        V        W        Y   
            m->m     m->i     m->d     i->m     i->i     d->m     d->d
  COMPO   2.34288  4.65458  2.72525  2.62519  3.03686  2.67067  3.52524  3.01724  3.00945  2.64165  3.76982  3.29469  3.35980  3.23149  2.70953  2.90053  2.80605  2.55611  4.08669  3.52142
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.05491  6.34906  2.96263  0.61958  0.77255  0.00000        *
      1   1.85458  4.97177  5.25772  5.11150  3.08317  4.68531  4.99660  2.29317  4.89333  2.21381  3.63812  4.70077  5.28514  5.02200  4.89677  3.94172  3.72771  0.88269  4.66437  3.70054      1 v - - H
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.00274  6.29689  7.01924  0.61958  0.77255  0.61121  0.78240
      2   2.68388  5.69592  2.79379  1.70132  5.25800  3.51515  3.94249  4.36550  2.81371  3.15390  5.18636  3.57935  4.80095  2.41963  1.52011  3.04363  2.67472  3.24049  6.57819  4.76611      2 r - - H
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.00268  6.31868  7.04103  0.61958  0.77255  0.65639  0.73131
      3   1.69557  6.15246  2.61454  2.25344  4.80524  3.13834  3.50575  4.98574  2.35803  3.99893  4.81684  3.13184  4.80553  2.73142  1.63172  3.26000  2.93426  4.54380  6.58436  4.54785      3 r - - H
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.00981  6.32551  4.83198  0.61958  0.77255  0.48312  0.95934

          [... removed this part to keep things reasonably sized ...]

    100   1.71561  3.64865  4.38480  3.42760  4.07808  4.69231  4.80409  3.24425  2.74698  3.28918  4.57912  4.05114  5.07262  3.40708  2.94600  2.98074  1.91959  1.46391  4.53056  4.17751    268 v - - E
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.01668  6.34179  4.21465  0.61958  0.77255  0.56154  0.84473
    101   2.40957  4.62174  3.16892  1.34079  4.55620  3.69164  2.87511  4.51265  2.91940  2.76532  5.18494  3.33723  4.80964  3.03069  2.08808  2.60615  3.78804  3.55867  4.81816  3.79971    269 e - - E
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.05884  6.32777  2.89402  0.61958  0.77255  0.69190  0.69440
    102   2.66258  5.17554  4.55953  2.87507  3.15074  3.58098  2.17249  2.34555  3.60158  2.58045  3.32878  3.02642  5.07290  3.26130  3.35997  3.29569  3.03481  2.15305  2.79882  3.02577    270 v - - E
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.00189  6.27083        *  0.61958  0.77255  0.00000        *
//

The important part for IC is the section with lines that start from 1 and go to 102. Each of these numbers are the match states of the HMM profile–in this case the profile has 102 match states. On each line starting with one of these numbers are the negative natural log of the emission probabilities for each of the 20 amino acids. Therefore, by parsing this file by taking the negative exponentiation of these values, one can establish the emission probabilities for every match state in every HMM profile.

To grab the IC values for each match state, I wrote a parser class, anvio.pfam.HMMProfile, that parses .hmm profiles. I tried to make something generic, not just for the single-minded purpose of calculating IC, so it is basic but also extensible for future purposes. From what I can tell, most of the information in basic .hmm formats is being captured. Alongside this, IC is calculated for each match state in a dataframe that can be accessed like so:

print(anvio.pfam.HMMProfile(<filepath>).data['PF01965']['MATCH_STATES'])

This prints the following:

MATCH_STATE CS MM RF CONS MAP IC
0 E - - K 1 0.469514
1 E - - K 2 0.992492
2 E - - V 3 1.79484
3 E - - L 4 0.83366
4 E - - V 5 1.30396
5 E - - L 6 0.758542
6 E - - L 7 0.779465
148 E - - S 214 1.88744
149 S - - R 215 0.457319
150 S - - G 216 0.439365
151 G - - P 217 0.608714
152 G - - G 218 0.433156
153 G - - A 219 0.529335
154 H - - A 220 0.347968
155 H - - I 221 0.28936

The match states are 0-indexed in this dataframe.

And so with this class, anvi’o has easy access to IC for each match state in an .hmm file. One need only initiate the above class as demonstrated.

The distribution of information content

Rather than testing for information content (IC) > 4, I wanted to keep the threshold a tunable parameter for the user (with a default being 4). To help guide anyone’s investigation, I have compiled all of the IC values for each match state in the representable InteracDome dataset (2375 HMM profiles). This totals 414619 IC values. Here is a histogram of them:

IC-hist

Wow, an absolutely gorgeous long-tailed distribution exhibiting Pareto-like qualities. Here it is in log-log:

IC-hist-log

I digress–it’s shape does not so much matter. But what I do find interesting is that only a very small portion of sites have IC > 4, which is the cutoff threshold Kobren and Singh applied. Only 0.062% of match states have IC > 4, and on average, each HMM profile contains only 0.11 such match states. To see the effect of applying this filter to real hits, I did a series of tests with decreasing cutoff thresholds of IC:

total hits IC cutoff filtered
2898 4 0
2898 3.5 30
2898 3.0 151
2898 2.5 561
2898 2.0 1380
2898 1.5 2071

It is impossible to really say whether decreasing the IC cutoff below 4 is a good or bad idea without taking a close examination of hits that are filtered. I did not do this. What I do agree with is that IC > 4 match states are extremely rare, and thus represent the pinnacle of what should be conserved an ultra-conserved residue. If the sequence mismatches to the consensus value at these positions, we should definitely be justified in throwing the hit out. It is at this moment unclear if we could justifiably lower this threshold, or whether that would be throwing away useful data.

Cysteine & glycine are stubborn

Fun fact, this table shows the consensus of match states with IC > 4, partitioned by amino acid.

AA Number
C 115
G 79
P 16
D 15
H 15
R 4
W 4
K 2
N 2
E 2
Q 1
A 1
T 1

Cyteine and glycine are astonishingly more likely to be ultra-conserved than any other residue.

(5) Matching binding frequencies to the genes’ residues

The flow

Once domain hits have been filtered, we are ready to attribute binding frequencies. The information flow of binding frequencies looks like this:

                    [InteracDome]               [HMMER]
(binding frequency) ------------> (match state) ------> (user gene residue)

Binding frequencies are stored in either the conserved or representable InteracDome table and are attributed to match states of the IPfams. This is a mapping of binding frequency to match state. By running HMMER, we extracted alignment information of match states to user genes. This creates a mapping between user gene residues and match states, and by extension a mapping between user gene residues and binding frequencies. Because of the legwork already described in (4), there is very little to describe.

The most noteworthy thing to talk about here is how I dealt with multiple overlapping hits.

Dealing with multiple overlapping hits

Binding frequencies are held in a dataframe like this:

gene_callers_id codon_order_in_gene pfam_id match_state ligand binding_freq
1 169 PF00534 22 ADP 0.687948
1 169 PF13692 8 ADP 0.595441
1 174 PF00534 27 ADP 0.735759
1 174 PF13692 14 ADP 0.595441
1 184 PF00534 37 ADP 0.0697656
1 184 PF13692 24 ADP 0.101399
1 186 PF00534 39 ADP 0.0697656
1 186 PF13692 26 ADP 0.101399
1 187 PF13692 27 ADP 0.201761
1 189 PF00534 47 ADP 0.0697656

This is a great format, because not only do you have binding frequencies associated with the residues of user gene sequences (codon_order_in_gene), you also can see the exact match states that contributed the binding frequency. But this has redundant entries. What I want is a single binding frequency (for a given ligand) associated to each codon_order_in_gene, yet this format allows redundancies. For example, further down the table there are these two entries:

gene_callers_id codon_order_in_gene pfam_id match_state ligand binding_freq
1 167 PF00534 20 ALL_ 0.0979682
1 167 PF13692 6 ALL_ 0.103877

Here, we have an amino acid position that has two contributions, both from match_state 20 of pfam_id PF00534, and match_state 6 of pfam_id PF13692. To condense this information, I created a second dataframe that collapses the redundant rows. For simplicity, I am quite simply averaging all binding frequencies (for a given ligand) that associate to a given amino acid position. The above table is stored in the bind_freq attribute of anvio.interacdome.InteracDomeSuper, and the second, collapsed table is stored in the avg_bind_freq attribute of the same class. In the redundant case above, the corresponding entry in avg_bind_freq looks like this:

gene_callers_id ligand codon_order_in_gene binding_freq
1 ALL_ 167 0.100922

match_state and pfam_id are dropped since these are no longer meaningful columns. However, their information is still held in bind_freq (and as you will see in the next section, bind_freq is stored as a tab-separated file so this potentially very useful information is not lost to the user).

Filtering low binding frequency scores

If users are not interested in positions with low binding frequency scores, they can apply a filter with --min-binding-frequency and all entries below this threshold will be removed. This is also a very convenient way to maintain reasonable table size for storage in the contigs database. Here is a histogram of non-zero binding frequencies from a test dataset:

binding-freq-hist

It’s clear from the above figure that even if the user is comfortable with discarding frequencies < 0.05, it has huge effect on table size.

(6) Storing the per-residue binding frequencies into the contigs database

The data stored in the contigs database is a variant of avg_bind_freq. As a reminder, here is what avg_bind_freq looks like this:

gene_callers_id ligand codon_order_in_gene binding_freq
1 ADP 169 0.641695
1 ADP 174 0.6656
1 ADP 184 0.0855824
1 ADP 186 0.0855824
1 ADP 187 0.201761
1 ADP 189 0.0855824
1 ADP 190 0.163235
1 ADP 192 0.095622
1 ADP 196 0.262405
1 ADP 199 0.161423

And here is the corresponding section of what is ultimately stored in the contigs database:

item_name data_key data_value data_type data_group
1:169 ADP 0.641695 float InteracDome
1:174 ADP 0.6656 float InteracDome
1:184 ADP 0.0855824 float InteracDome
1:186 ADP 0.0855824 float InteracDome
1:187 ADP 0.201761 float InteracDome
1:189 ADP 0.0855824 float InteracDome
1:190 ADP 0.163235 float InteracDome
1:192 ADP 0.095622 float InteracDome
1:196 ADP 0.262405 float InteracDome
1:199 ADP 0.161423 float InteracDome

It may seem to be in oddly specific format, and that’s because it is. You see, up until a few months ago, anvi’o contigs databases had no means of storing per-nucleotide or per-residue information. Crazy, I know. The only thing that resembled this concept was per-nucleotide coverage values, but this is actually stored in the auxiliary database (a tumor of the profile database) that’s created during anvi-profile. A few months ago I wanted to address this shortcoming for the selfish reason of knowing that one day I would implement InteracDome functionality into anvi’o. And I didn’t want anvi’o to spit out some half-chewed tab-separated file to the user–I wanted the results to be stored in the contigs database so they could be used in integrated and interactive manners. So I created two new tables in the contigs database called amino_acid_additional_data and nucleotide_additional_data, which hijack the already existing framework for additional data tables. The convenience factor was through the roof for doing this, as I had to write very little code (here is the pull request) and now we can store arbitrary per-nucleotide and per-amino-acid annotations (not just InteracDome stuff, but practically anything besides .mp4 format). However, two downsides are very appreciable.

First, the framework requires a unique key for each amino acid. Yet, it is technically two keys that define an amino acid: the gene_callers_id of the gene it belongs to, and the codon_order_in_gene (the residue position) of the amino acid relative to its gene. The same goes for defining a nucleotide: you need the position in the contig (pos_in_contig) and you need the contig_name. So how do you get one key from two pieces of information? The solution, which is ugly, is to create a single key by concatenating these two pieces of information into a string. Doing so has distinct disadvantages, the main one being that SQL queries are hindered: it becomes impossible to straight-forwardly grab all amino acids belonging to a gene without first loading all entries into memory, for example. Regardless, I did it. Someone will hate me in 5 years.

Take another look at the above table and you will see the string concatenation I’m talking about under the item_name column. Entries under item_name equal to <gene_callers_id>:<codon_order_in_gene>, for example the first entry 1:169 is the 169th residue (0-indexed) in gene 1.

The second downside to doing this is specific to anvi-run-interacdome. You see, there is really a lot of information missing from the above table. For example, it would be great to know which IPfam ID(s) contributed a given binding freqency and with which of their match states they did so. I already discussed that this information is all present in the dataframe bind_freq. Unfortunately there just isn’t enough columns to accomodate this information, and I am fixed within this framework because I didn’t want to reinvent the wheel. To console victims of this tyranny, anvi-run-interacdome does more than just store the above table in the contigs database. It also outputs two tab-delimited files. Their names are by default INTERACDOME-match_state_contributors.txt and INTERACDOME-domain_hits.txt

INTERACDOME-match_state_contributors.txt is a table of the exact contents in bind_freq mentioned above. Since you’re somehow still reading this, I’ll show you the table again:

gene_callers_id codon_order_in_gene pfam_id match_state ligand binding_freq
1 169 PF00534 22 ADP 0.687948
1 169 PF13692 8 ADP 0.595441
1 174 PF00534 27 ADP 0.735759
1 174 PF13692 14 ADP 0.595441
1 184 PF00534 37 ADP 0.0697656
1 184 PF13692 24 ADP 0.101399
1 186 PF00534 39 ADP 0.0697656
1 186 PF13692 26 ADP 0.101399
1 187 PF13692 27 ADP 0.201761
1 189 PF00534 47 ADP 0.0697656

From this, one can trace back each and every match state that contributed to a given binding frequency. This is what giving power to the user means.

INTERACDOME-domain_hits.txt is a table of the exact contents in dom_hits mentioned above. Since you’re somehow still reading this, I’ll show you the table again:

pfam_name pfam_id corresponding_gene_call domain qual score bias c-evalue i-evalue hmm_start hmm_stop hmm_bounds ali_start ali_stop ali_bounds env_start env_stop env_bounds mean_post_prob match_state_align comparison_align sequence_align version
Beta_elim_lyase PF01212 1762 1 ! 20.9 0.1 1e-08 3.5e-06 33 169 .. 44 177 .. 34 215 .. 0.72 tvnrLedavaelfgke..aalfvpqGtaAnsill.kill.qr..geevivtepahihfdetgaiaelagvklrdlknkeaGkmdlekleaaikevgaheekiklisltvTnntagGqvvsleelrevaaiakkygiplhlDgA ++ +++ael+ + f+ Gt +++ l + + +r g+ +i++ h +et + g +l ++ +++G +++e+l+++i++ e i + +++v n+ G++ +++e+ ev +a+ +i++h+D+ LLQQARKQIAELINVSanEIYFTSGGTEGDNWVLkGTAIeKRefGNHIIISAVEHPAVTETAEQLVELGFELSYAPVDKEGRVKVEELQKLIRK—–ETILVSVMAVNNE–VGTIQPIKEISEV–LAEFPKIHFHVDAV 20
PAPS_reduct PF01507 1541 1 ! 36.1 0.1 3.6e-13 1.3e-10 2 164 .. 21 231 .. 20 234 .. 0.79 lvvsvsgGkdslVllhLalkafkpv….pvvfvdtghefpetiefvdeleeryglrlkvyepeeevaekinaekhgs.slyee.aaeriaKveplkk……………………………aLekldedall..tGaRrdesksraklpiveidedfek………slrvfPllnWteedvwqyilrenipynpLydqgfr + +s+sgGkds +++La + ++ ++ ++ + ++ t++f++++e+ +++ +++ ++++ + + +++ + + + e+ + p k e++ ++a+ +G+R++es +r++ +++ +++ + ++Pl++W+ d+w+ + +++yn +y++ ++ VYFSFSGGKDSGLMVQLANLVAEKLdrnfDLLILNIEANYTATVDFIKKIEQLPRVKNIYHFCLPFFEDNNTSFFQPQwKMWDPsEKEKWIHSLP–KnaitleniddglkkyyslsngnpdrflryfqnwYKEQYPQSAIScgVGIRAQESLHRHSAVTKGENKYKNRcwinitlegNILFYPLFDWKVGDIWAATFKCELEYNYIYEKMYK 18
Ank_2 PF12796 1756 1 ! 32.2 0 6.7e-12 2.3e-09 29 84 .] 74 135 .. 53 135 .. 0.85 aLhyAakngnleivklLle…h.a..adndgrtpLhyAarsghleivklLlekgadinlkd aL Aa + +++ vk +l+ + + +d +g+tpL +A+ ++ +ei+k L+++gadinl++ ALLEAANQRDTKKVKEILQdttYqVdeVDTEGNTPLNIAVHNNDIEIAKALIDRGADINLQN 6
Ank_2 PF12796 1756 2 ! 28.5 0 9.5e-11 3.3e-08 22 75 .. 199 265 .. 195 267 .] 0.76 pn..k.ngktaLhyAak..ngnl…eivklLleha…..adndgrtpLhyAarsghleivklLle ++ + +g taL+ A+ +gn +ivklL+e++ dn+grt++ yA ++g++ei k+L + IDfqNdFGYTALIEAVGlrEGNQlyqDIVKLLMENGadqsiKDNSGRTAMDYANQKGYTEISKILAQ 6
IGPS PF00218 1615 1 ! 20.6 0.1 1.2e-08 4e-06 202 249 .. 195 242 .. 73 248 .. 0.88 LaklvpkdvllvaeSGiktredveklkeegvnafLvGeslmrqedvek +++lv+++++++ae i+t+e+++++k+ gv ++ vG +++r ++ +k IKQLVQENICVIAEGKIHTPEQARQIKKLGVAGIVVGGAITRPQEIAK 20
Ribosomal_L33 PF00471 1562 1 ! 66.6 1.5 1.1e-22 3.7e-20 2 47 .] 4 49 .] 3 49 .] 0.97 kvtLeCteCksrnYtttknkrntperLelkKYcprcrkhtlhkEtK +++LeC e+++r Y t+knkrn+perLelkKY p++r++ ++kE K NIILECVETGERLYLTSKNKRNNPERLELKKYSPKLRRRAIFKEVK 19
Ribosomal_S14 PF00253 1565 1 ! 83.3 0.1 3.9e-28 1.3e-25 2 54 .] 36 88 .. 35 88 .. 0.98 laklprnssptrirnrCrvtGrprGvirkfgLsRicfRelAlkgelpGvkKaS laklpr+s+p+r+r r++ +GrprG++rkfg+sRi+fRel ++g +pGvkKaS LAKLPRDSNPNRLRLRDQTDGRPRGYMRKFGMSRIKFRELDHQGLIPGVKKAS 20
Polysacc_synt_C PF14667 1593 1 ! 61.4 19.2 5.4e-21 1.9e-18 2 139 .. 371 516 .. 370 519 .. 0.83 LailalsiiflslstvlssiLqglgrqkialkalvigalvklilnllliplfgivGaaiatvlallvvavlnlyalrrllgikl…llrrllkpllaalvmgivvylllllllglllla…al..alllavlvgalvYllllll L+ ++s+ +l+++t++ siLq+l +k+a+ ++ i++l+kli+++++i+lf +G +iat+++ ++++++ +++l+r++ i++ ++ +++ +++vm i+ +l+l+++ ++ + +l + l +++g++v+ + l++ LSATIISTSLLGIFTIVLSILQALSFHKKAMQITSITLLLKLIIQIPCIYLFKGYGLSIATIICTMFTTIIAYRFLSRKFDINPikyNRKYYSRLVYSTIVMTILSLLMLKIISSVYKFEstlQLffLISLIGCLGGVVFSVTLFR 5

From this, one can learn about how good a given hit was, the alignment of the user gene to the HMM match states, and hopefully anything else the user would be interested in.

Conclusion

Wow! That was certainly a mouthful. I know in a year or two when I am writing my thesis, this will be helpful to me, and I hope that it has been helpful to you too. If you have questions, let me know. And if you want to see all of this in action, be sure to check out the InteracDome section of the infant gut tutorial.