Chapter V - Reproducing Kiefl et al, 2022

Evan Kiefl

Quick Navigation
Preface
Results and Discussion

Intro
Polymorphism rates reveal intense purification of nonsynonymous mutants
Nonsynonymous polymorphism avoids buried sites
Nonsynonymous polymorphism avoids active sites
Proteomic trends in purifying selection are explained by RSA and DTL
Measuring purifying selection between genes and environments with pN/pS(gene)
Nitrogen availability governs rates of non-ideal polymorphism at critical sites of glutamine synthetase
Synonymous but not silent: selection against rare codons at critical sites

Methods

Predicting and processing protein structures
Predicting ligand-binding sites
Calculating distance-to-ligand (DTL)
Proportion of polymorphism rate variance explained by RSA and DTL
Supplementary Figures

Supplementary Information

Regimes of sequence similarity probed by metagenomics, SAR11 cultured genomes, and protein families
Comparing structure predictions between AlphaFold and MODELLER
RSA and DTL predict nonsynonymous polymorphism rates
dN/dS(gene) and sample-averaged pN/pS(gene) yield consistent results
Transcript abundance largely explains genic differences in the strengths of purifying selection
Stability analysis of polymorphism distributions with respect to pN/pS(core)

Chapter I: The prologue
Chapter II: Configure your system
Chapter III: Build the data
Chapter IV: Analyze the data
Chapter V: Reproduce every number ← you are here

Preface

If you are trying to figure out where a particular number came from, you’re in the right placei, because this Chapter is dedicated exclusively to reproducing all of the numbers found in the text of the manuscript. All numbers are categorized by the various sections they appear in, and are presented in the order they appear in the section.

Each number can be reproduced by running the associated code block.

Each code block begins with a comment that specifies the programming language used, e.g. # python. The three languages used are python, bash, and R.

If the language is either python or bash, it is assumed that your working directory is the root directory of the project, i.e. /some/path/that/ends/in/kiefl_2021.

If the language is R, it is assumed that the commands are issued from a fully initialized GRE. Once you’ve built your GRE, you can ensure your GRE is fully initialized by issuing the following commands:

request_scvs <- TRUE
request_regs <- TRUE
withRestarts(source("load_data.R"), terminate=function() message('load_data.R: data already loaded. Nice.'))

Finally, it is assumed you have completed all Steps from Chapter III (or equivalently downloaded the last checkpoint datapack). Some code blocks may require you to complete specific Analyses from Chapter IV. Such steps will specify, with a comment, which Analysis is a prerequisite.

Finally finally, some numbers that are repeatedly multiple times throughout the text. A code block is only provided for the first instance in which it appears.

Results and Discussion

Intro

(...) resulting in 390 million reads that were 94.5% identical to HIMB83 on average (Figure S1) (...)

# R
# Prerequisite: Analysis 2
args <- list()
args$reads <- "../18_PERCENT_ID.txt"
args$pfam <- "../18_PERCENT_ID_PFAM.txt"
args$pangenome <- "../18_PERCENT_ID_PANGENOME.txt"
reads <- read_tsv(args$reads) %>%
pivot_longer(cols=-gene_callers_id) %>%
group_by(gene_callers_id) %>%
summarise(reads = mean(value))
pfam <- read_tsv(args$pfam) %>%
rename(pfam = percent_id)
pan <- read_tsv(args$pangenome) %>%
rename(pan = percent_id)
df_wide <- reads %>%
left_join(pfam) %>%
left_join(pan)
df <- df_wide %>%
pivot_longer(cols=-gene_callers_id)
options(pillar.sigfig = 5)
print(df %>% group_by(name) %>% summarise(mean=mean(value, na.rm=TRUE), median=median(value, na.rm=TRUE)))

(...) Of the 1,470 genes in HIMB83 (...)

# bash
anvi-export-gene-calls -c CONTIGS.db -o temp --gene-caller prodigal
cat temp | tail +2 | wc -l

(...) we restricted our analysis to 799 genes that we determined to form the 1a.3.V core genes (...)

# bash
wc -l goi

(...) and 74 metagenomes in which the average coverage of HIMB83 exceeded 50X (...)

# bash
wc -l soi

Polymorphism rates reveal intense purification of nonsynonymous mutants

(...) Within the 1a.3.V core genes, we found a total of 9,537,022 SCVs, or 128,879 per metagenome on average (...)

# python
import pandas as pd
scv = pd.read_csv("11_SCVs.txt", sep='\t')
scv = scv[scv['departure_from_consensus'] > 0] # remove quince and kiefl entries
scv.shape[0]

(...) Within the 1a.3.V core genes, we found a total of 9,537,022 SCVs, or 128,879 per metagenome on average (...)

# python
import pandas as pd
scv = pd.read_csv("11_SCVs.txt", sep='\t')
scv = scv[scv['departure_from_consensus'] > 0] # remove quince and kiefl entries
scv.groupby("sample_id")['unique_pos_identifier'].count().mean()

(...) SCVs distributed throughout the genome such that 78% of codons (32% of nucleotides) exhibited minor allele frequencies >10% in at least one metagenome (...)

# R
tmp <- scvs %>% group_by(unique_pos_identifier) %>% summarize(max_var = max(departure_from_consensus, na.rm=T))
tmp %>% filter(max_var > 0.1) %>% dim() %>% .[[1]] / tmp %>% dim() %>% .[[1]]

(...) SCVs distributed throughout the genome such that 78% of codons (32% of nucleotides) exhibited minor allele frequencies >10% in at least one metagenome (...)

first do this

# bash
anvi-gen-variability-profile -c CONTIGS.db -p PROFILE.db --samples-of-interest soi --genes-of-interest goi --engine NT -o 11_SNVs.txt

then do this

# R
goi <- read_tsv("../goi", col_names=F) %>% pull(X1)
snvs <- read_tsv("../11_SNVs.txt")
tmp <- snvs %>% filter(corresponding_gene_call %in% goi) %>% group_by(unique_pos_identifier) %>% summarize(max_df = max(departure_from_consensus))
total_nts <- scvs %>% pull(unique_pos_identifier) %>% unique() %>% length() * 3
nts <- tmp %>% filter(max_df > 0.1) %>% dim() %>% .[[1]]
nts/total_nts

(...) our read recruitment strategy is stringent and yields reads that on average differ from HIMB83 in only 6 nucleotides out of 100 (Table S2) (...)

# R
args <- list()
args$reads <- "../18_PERCENT_ID.txt"
args$pfam <- "../18_PERCENT_ID_PFAM.txt"
args$pangenome <- "../18_PERCENT_ID_PANGENOME.txt"
reads <- read_tsv(args$reads) %>%
pivot_longer(cols=-gene_callers_id) %>%
group_by(gene_callers_id) %>%
summarise(reads = mean(value))
pfam <- read_tsv(args$pfam) %>%
rename(pfam = percent_id)
pan <- read_tsv(args$pangenome) %>%
rename(pan = percent_id)
df_wide <- reads %>%
left_join(pfam) %>%
left_join(pan)
df <- df_wide %>%
pivot_longer(cols=-gene_callers_id)
options(pillar.sigfig = 5)
tmp <- df %>% group_by(name) %>% summarise(mean=mean(value, na.rm=TRUE), median=median(value, na.rm=TRUE))
round(100-tmp[3,2], 0)

(...) Overall, we found that the average pS(site) outweighed pN(site) by 19:1 (Table S3) (...)

# python
import pandas as pd
df = pd.read_csv('11_SCVs.txt', sep='\t')
print(df.pS_popular_consensus.sum()/df.pN_popular_consensus.sum())

Nonsynonymous polymorphism avoids buried sites

(...) pN(site) values varied significantly from site-to-site and from sample-to-sample, but overall, site-to-site variance was more explanatory than sample-to-sample variance (79.74% ± 0.11% versus 0.42% ± 0.01% of total variance, ANOVA) (Figure S3) (...)

# R
scvs <- read_tsv("../11_SCVs.txt") %>%
    select(
        sample_id,
        unique_pos_identifier,
        pN_popular_consensus,
        pS_popular_consensus
    ) %>%
    filter(
        pS_popular_consensus > 0
    ) %>%
    mutate(
        unique_pos_identifier = as.factor(unique_pos_identifier)
    )

N <- 1000
exps <- 500
pN_site_vars <- c()
pN_sample_vars <- c()
pS_site_vars <- c()
pS_sample_vars <- c()

sites <- scvs %>% pull(unique_pos_identifier) %>% unique()

# This will take hours
for (i in 1:exps) {
    print(i)

    subset_sites <- sites %>% sample(N)
    scvs_subset <- scvs %>% filter(unique_pos_identifier %in% subset_sites)

    pN_anova <- scvs_subset %>%
        lm(pN_popular_consensus ~ unique_pos_identifier + sample_id, data = .) %>%
        anova()

    pS_anova <- scvs_subset %>%
        lm(pS_popular_consensus ~ unique_pos_identifier + sample_id, data = .) %>%
        anova()

    pN_site_explained <- pN_anova$`Sum Sq`[1] / sum(pN_anova$`Sum Sq`) * 100
    pN_sample_explained <- pN_anova$`Sum Sq`[2] / sum(pN_anova$`Sum Sq`) * 100
    pS_site_explained <- pS_anova$`Sum Sq`[1] / sum(pS_anova$`Sum Sq`) * 100
    pS_sample_explained <- pS_anova$`Sum Sq`[2] / sum(pS_anova$`Sum Sq`) * 100

    pN_site_vars <- c(pN_site_vars, pN_site_explained)
    pN_sample_vars <- c(pN_sample_vars, pN_sample_explained)
    pS_site_vars <- c(pS_site_vars, pS_site_explained)
    pS_sample_vars <- c(pS_sample_vars, pS_sample_explained)
}

pN_site_mean <- pN_site_vars %>% mean()
pN_site_stderr <- pN_site_vars %>% sd() / sqrt(exps)
pS_site_mean <- pS_site_vars %>% mean()
pS_site_stderr <- pS_site_vars %>% sd() / sqrt(exps)
pN_sample_mean <- pN_sample_vars %>% mean()
pN_sample_stderr <- pN_sample_vars %>% sd() / sqrt(exps)
pS_sample_mean <- pS_sample_vars %>% mean()
pS_sample_stderr <- pS_sample_vars %>% sd() / sqrt(exps)

print(paste("% pN(site) explained by site:", pN_site_mean, "+-", pN_site_stderr))
print(paste("% pN(site) explained by sample:", pN_sample_mean, "+-", pN_sample_stderr))
print(paste("% pS(site) explained by site:", pS_site_mean, "+-", pS_site_stderr))
print(paste("% pS(site) explained by sample:", pS_sample_mean, "+-", pS_sample_stderr))

(...) We used two independent methods to predict protein structures for the 799 core genes of 1a.3.V: (1) a template-based homology modeling approach with MODELLER (Webb and Sali 2016), which predicted 346 structures, and (2) a transformer-like deep learning approach with AlphaFold (Jumper et al. 2021), which predicted 754 (...)

# bash
wc -l 12_GENES_WITH_GOOD_STRUCTURES_MODELLER

(...) We used two independent methods to predict protein structures for the 799 core genes of 1a.3.V: (1) a template-based homology modeling approach with MODELLER (Webb and Sali 2016), which predicted 346 structures, and (2) a transformer-like deep learning approach with AlphaFold (Jumper et al. 2021), which predicted 754 (...)

# bash
wc -l 12_GENES_WITH_GOOD_STRUCTURES

(...) Our evaluation of the 339 genes for which both methods predicted structures (Supplementary Information) revealed a comparable accuracy between AlphaFold and MODELLER (Figure S4, Table S4) (...)

# bash
goi_af = set([int(x.strip()) for x in open('12_GENES_WITH_GOOD_STRUCTURES').readlines()])
goi_mod = set([int(x.strip()) for x in open('12_GENES_WITH_GOOD_STRUCTURES_MODELLER').readlines()])
len(goi_af.intersection(goi_mod))

(...) These data showed that pS(site) closely resembled the null distribution (two-sample Kolmogorov-Smirnov statistic = 0.016), which illustrates the lack of influence of RSA on s-polymorphism, while pN(site) deviated significantly and instead exhibited strong preference for sites with higher RSA (two-sample Kolmogorov-Smirnov statistic = 0.235). (...)

# R
psRSA <- perform_ks_test("pS_popular_consensus", "rel_solvent_acc", trials=10)
psRSA[["KS_Stat"]]

Show/Hide perform_ks_test

perform_ks_test is already in the GRE. But for your reference, here it is alongside its helper function and a stackoverflow reference:

# https://stackoverflow.com/questions/40044375/how-to-calculate-the-kolmogorov-smirnov-statistic-between-two-weighted-samples/55664242#55664242
ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    pkstwo <- function(x, tol = 1e-100) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    if (pval == 0) {
        # R internals (stats:::C_pKS2) has returned 0 for the probability. But an order of the
        # precision is useful, which I approximate by multiplying the one-sided hypothesis pval
        # by 2.
        one_sided <- exp(-2*n*d^2)
        pval <- 2 * one_sided
    }

    if (pval == 0) {
        # If pval is _still_ 0, then the value is <1e-300. I know this because I have observed the
        # value 2.021712e-304 through experimentation.
        pval <- 1e-300
    }

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}

perform_ks_test <- function(weights, variable, trials=10) {
    library(spatstat)
    KS <- 1e100
    p <- 0
    d <- scvs %>%
        select((!!sym(variable)), (!!sym(weights))) %>%
        filter(!is.na((!!sym(weights))), !is.na((!!sym(variable))))
    for (i in 1:trials) {
        shuffle <- d %>% sample_n(size=d %>% dim() %>% .[[1]], replace=F) %>% .[2] %>% .[[1]]
        d$shuffled_weights <- shuffle/sum(shuffle)
        d$weights <- (d[,weights] / sum(d[,weights])) %>% .[[1]]
        out <- ks_weighted(vector_1=d[1][[1]], vector_2=d[1][[1]], weights_1=d$weights, weights_2=d$shuffled_weights)
        trial_KS <- out[1][[1]]
        trial_p <- out[2][[1]]
        if (trial_KS < KS) {KS <- trial_KS}
        if (trial_p > p) {p <- trial_p}
    }
    return(c(KS_Stat=KS, P_value=p))
}

(...) These data showed that pS(site) closely resembled the null distribution (two-sample Kolmogorov-Smirnov statistic = 0.016), which illustrates the lack of influence of RSA on s-polymorphism, while pN(site) deviated significantly and instead exhibited strong preference for sites with higher RSA (two-sample Kolmogorov-Smirnov statistic = 0.235). (...)

# R
pnRSA <- perform_ks_test("pN_popular_consensus", "rel_solvent_acc", trials=10)
pnRSA[["KS_Stat"]]

Show/Hide perform_ks_test

perform_ks_test is already in the GRE. But for your reference, here it is alongside its helper function and a stackoverflow reference:

# https://stackoverflow.com/questions/40044375/how-to-calculate-the-kolmogorov-smirnov-statistic-between-two-weighted-samples/55664242#55664242
ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    pkstwo <- function(x, tol = 1e-100) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    if (pval == 0) {
        # R internals (stats:::C_pKS2) has returned 0 for the probability. But an order of the
        # precision is useful, which I approximate by multiplying the one-sided hypothesis pval
        # by 2.
        one_sided <- exp(-2*n*d^2)
        pval <- 2 * one_sided
    }

    if (pval == 0) {
        # If pval is _still_ 0, then the value is <1e-300. I know this because I have observed the
        # value 2.021712e-304 through experimentation.
        pval <- 1e-300
    }

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}

perform_ks_test <- function(weights, variable, trials=10) {
    library(spatstat)
    KS <- 1e100
    p <- 0
    d <- scvs %>%
        select((!!sym(variable)), (!!sym(weights))) %>%
        filter(!is.na((!!sym(weights))), !is.na((!!sym(variable))))
    for (i in 1:trials) {
        shuffle <- d %>% sample_n(size=d %>% dim() %>% .[[1]], replace=F) %>% .[2] %>% .[[1]]
        d$shuffled_weights <- shuffle/sum(shuffle)
        d$weights <- (d[,weights] / sum(d[,weights])) %>% .[[1]]
        out <- ks_weighted(vector_1=d[1][[1]], vector_2=d[1][[1]], weights_1=d$weights, weights_2=d$shuffled_weights)
        trial_KS <- out[1][[1]]
        trial_p <- out[2][[1]]
        if (trial_KS < KS) {KS <- trial_KS}
        if (trial_p > p) {p <- trial_p}
    }
    return(c(KS_Stat=KS, P_value=p))
}

Nonsynonymous polymorphism avoids active sites

(...) [Figure 2:] The pN(site) distribution (red line) and pS(site) distribution (blue line) were created by weighting the RSA values of 239,528 sites (coming from the 754 genes with predicted structures) by the pN(site) and pS(site) values observed in each of the 74 samples, totaling 17,725,072 pN(site) and pS(site) values (...)

# R
scvs %>% filter(!is.na(rel_solvent_acc)) %>% .$unique_pos_identifier %>% unique() %>% length()

(...) [Figure 2:] The pN(site) distribution (red line) and pS(site) distribution (blue line) were created by weighting the RSA values of 239,528 sites (coming from the 754 genes with predicted structures) by the pN(site) and pS(site) values observed in each of the 74 samples, totaling 17,725,072 pN(site) and pS(site) values (...)

# R
scvs %>% filter(!is.na(rel_solvent_acc)) %>% dim() %>% .[[1]]

(...) [Figure 2:] The pN(site) distribution (red line) and pS(site) distribution (blue line) were created by weighting the DTL values of 155,478 sites (coming from 415 genes that had predicted structures and at least one predicted ligand) by the pN(site) and pS(site) values observed in each of the 74 samples, totaling 11,505,372 pN(site) and pS(site) values (...)

# R
scvs %>% filter(!is.na(ANY_dist)) %>% .$unique_pos_identifier %>% unique() %>% length()

(...) [Figure 2:] The pN(site) distribution (red line) and pS(site) distribution (blue line) were created by weighting the DTL values of 155,478 sites (coming from 415 genes that had predicted structures and at least one predicted ligand) by the pN(site) and pS(site) values observed in each of the 74 samples, totaling 11,505,372 pN(site) and pS(site) values (...)

# R
scvs %>% filter(!is.na(ANY_dist)) %>% .$gene_callers_id %>% unique() %>% length()

(...) [Figure 2:] The pN(site) distribution (red line) and pS(site) distribution (blue line) were created by weighting the DTL values of 155,478 sites (coming from 415 genes that had predicted structures and at least one predicted ligand) by the pN(site) and pS(site) values observed in each of the 74 samples, totaling 11,505,372 pN(site) and pS(site) values (...)

# R
scvs %>% filter(!is.na(ANY_dist)) %>% dim() %>% .[[1]]

(...) A model has been fit to each gene-sample pair that passed filtering criteria (see Supplementary Information), resulting in 16,285 nonsynonymous models and 24,553 synonymous models. (...)

# R
pn_models %>% dim() %>% .[[1]]

(...) A model has been fit to each gene-sample pair that passed filtering criteria (see Supplementary Information), resulting in 16,285 nonsynonymous models and 24,553 synonymous models. (...)

# R
ps_models %>% dim() %>% .[[1]]

(...) The average per-site ns-polymorphism rate throughout the 1a.3.V core genome was 0.0088, however, we observed a nearly 4-fold reduction in this rate to just 0.0024 at predicted ligand binding sites (DTL = 0) (...)

# R
scvs %>% pull(pN_popular_consensus) %>% mean(na.rm=T)

(...) The average per-site ns-polymorphism rate throughout the 1a.3.V core genome was 0.0088, however, we observed a nearly 4-fold reduction in this rate to just 0.0024 at predicted ligand binding sites (DTL = 0) (...)

# R
scvs %>% filter(ANY_dist == 0) %>% pull(pN_popular_consensus) %>% mean(na.rm=T)

(...) indicating significantly (left-tailed Z test, p-value < <1x10-300) stronger purifying selection at ligand-binding sites (...)

# R
population_mean <- scvs %>% pull(pN_popular_consensus) %>% mean(na.rm=T)
population_sd <- scvs %>% pull(pN_popular_consensus) %>% sd(na.rm=T)
sample_mean <- scvs %>% filter(ANY_dist == 0) %>% pull(pN_popular_consensus) %>% mean(na.rm=T)
sample_size <- scvs %>% filter(ANY_dist == 0) %>% pull(pN_popular_consensus) %>% length()
Z <- (sample_mean - population_mean) / (population_sd / sqrt(sample_size))
p_val <- pnorm(Z, lower.tail=T)
print(p_val)

(...) Sites neighboring ligand-binding regions also harbored disproportionately low rates of ns-polymorphism, as indicated by the significant deviation towards larger DTL values (two-sample Kolmogorov-Smirnov statistic = 0.157). (...)

# R
pnDTL <- perform_ks_test("pN_popular_consensus", "ANY_dist", trials=10)
pnDTL[["KS_Stat"]]

Show/Hide perform_ks_test

perform_ks_test is already in the GRE. But for your reference, here it is alongside its helper function and a stackoverflow reference:

# https://stackoverflow.com/questions/40044375/how-to-calculate-the-kolmogorov-smirnov-statistic-between-two-weighted-samples/55664242#55664242
ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    pkstwo <- function(x, tol = 1e-100) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    if (pval == 0) {
        # R internals (stats:::C_pKS2) has returned 0 for the probability. But an order of the
        # precision is useful, which I approximate by multiplying the one-sided hypothesis pval
        # by 2.
        one_sided <- exp(-2*n*d^2)
        pval <- 2 * one_sided
    }

    if (pval == 0) {
        # If pval is _still_ 0, then the value is <1e-300. I know this because I have observed the
        # value 2.021712e-304 through experimentation.
        pval <- 1e-300
    }

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}

perform_ks_test <- function(weights, variable, trials=10) {
    library(spatstat)
    KS <- 1e100
    p <- 0
    d <- scvs %>%
        select((!!sym(variable)), (!!sym(weights))) %>%
        filter(!is.na((!!sym(weights))), !is.na((!!sym(variable))))
    for (i in 1:trials) {
        shuffle <- d %>% sample_n(size=d %>% dim() %>% .[[1]], replace=F) %>% .[2] %>% .[[1]]
        d$shuffled_weights <- shuffle/sum(shuffle)
        d$weights <- (d[,weights] / sum(d[,weights])) %>% .[[1]]
        out <- ks_weighted(vector_1=d[1][[1]], vector_2=d[1][[1]], weights_1=d$weights, weights_2=d$shuffled_weights)
        trial_KS <- out[1][[1]]
        trial_p <- out[2][[1]]
        if (trial_KS < KS) {KS <- trial_KS}
        if (trial_p > p) {p <- trial_p}
    }
    return(c(KS_Stat=KS, P_value=p))
}

(...) Comparatively, pS(site) deviated minimally from the null distribution (two-sample Kolmogorov-Smirnov statistic = 0.013). (...)

# R
psDTL <- perform_ks_test("pS_popular_consensus", "ANY_dist", trials=10)
psDTL[["KS_Stat"]]

Show/Hide perform_ks_test

perform_ks_test is already in the GRE. But for your reference, here it is alongside its helper function and a stackoverflow reference:

# https://stackoverflow.com/questions/40044375/how-to-calculate-the-kolmogorov-smirnov-statistic-between-two-weighted-samples/55664242#55664242
ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    pkstwo <- function(x, tol = 1e-100) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    if (pval == 0) {
        # R internals (stats:::C_pKS2) has returned 0 for the probability. But an order of the
        # precision is useful, which I approximate by multiplying the one-sided hypothesis pval
        # by 2.
        one_sided <- exp(-2*n*d^2)
        pval <- 2 * one_sided
    }

    if (pval == 0) {
        # If pval is _still_ 0, then the value is <1e-300. I know this because I have observed the
        # value 2.021712e-304 through experimentation.
        pval <- 1e-300
    }

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}

perform_ks_test <- function(weights, variable, trials=10) {
    library(spatstat)
    KS <- 1e100
    p <- 0
    d <- scvs %>%
        select((!!sym(variable)), (!!sym(weights))) %>%
        filter(!is.na((!!sym(weights))), !is.na((!!sym(variable))))
    for (i in 1:trials) {
        shuffle <- d %>% sample_n(size=d %>% dim() %>% .[[1]], replace=F) %>% .[2] %>% .[[1]]
        d$shuffled_weights <- shuffle/sum(shuffle)
        d$weights <- (d[,weights] / sum(d[,weights])) %>% .[[1]]
        out <- ks_weighted(vector_1=d[1][[1]], vector_2=d[1][[1]], weights_1=d$weights, weights_2=d$shuffled_weights)
        trial_KS <- out[1][[1]]
        trial_p <- out[2][[1]]
        if (trial_KS < KS) {KS <- trial_KS}
        if (trial_p > p) {p <- trial_p}
    }
    return(c(KS_Stat=KS, P_value=p))
}

Proteomic trends in purifying selection are explained by RSA and DTL

(...) By fitting a series of linear models to log-transformed polymorphism data (Table S6), we conclude that RSA and DTL can explain 11.83% and 6.89% of pN(site) variation, respectively (...)

# R
# Prerequisite: Analysis 11
df <- read_tsv('../WW_TABLES/MODELS.txt')
df$RSA[[2]]

(...) By fitting a series of linear models to log-transformed polymorphism data (Table S6), we conclude that RSA and DTL can explain 11.83% and 6.89% of pN(site) variation, respectively (...)

# R
# Prerequisite: Analysis 11
df <- read_tsv('../WW_TABLES/MODELS.txt')
df$DTL[[4]]

(...) Based on these models we estimate that for any given gene in any given sample, (1) a 1% increase in RSA corresponds to a 0.98% increase in pN(site), and (2) a 1% increase in DTL (normalized by the maximum DTL in the gene) corresponds to a 0.90% increase in pN(site) (...)

# R
# Prerequisite: Analysis 11
lm_rsa_gene_sample_pn <- readRDS("../lm_rsa_gene_sample_pn.RDS")
print(round(100*(exp((lm_rsa_gene_sample_pn %>% summary %>% coef)["rel_solvent_acc","Estimate"] * 0.01) - 1), 2))
rm(lm_rsa_gene_sample_pn)

(...) Based on these models we estimate that for any given gene in any given sample, (1) a 1% increase in RSA corresponds to a 0.98% increase in pN(site), and (2) a 1% increase in DTL (normalized by the maximum DTL in the gene) corresponds to a 0.90% increase in pN(site) (...)

# R
# Prerequisite: Analysis 11
lm_dtl_gene_sample_pn <- readRDS("../lm_dtl_gene_sample_pn.RDS")
print(round(100*(exp((lm_dtl_gene_sample_pn %>% summary %>% coef)["ANY_dist","Estimate"] * 0.01) - 1), 2))
history()
rm(lm_dtl_gene_sample_pn)

(...) In a combined model, RSA and DTL jointly explained 14.12% of pN(site) variation, and after adjusting for gene-to-gene and sample-to-sample variance, 17.07% of the remaining variation could be explained by RSA and DTL (...)

# R
# Prerequisite: Analysis 11
lm_rsa_dtl_gene_sample_pn <- readRDS("../lm_rsa_dtl_gene_sample_pn.RDS")
formatted_anova_pn_rsa_dtl <- (100*(lm_rsa_dtl_gene_sample_pn %>% anova)$"Sum Sq"/sum((lm_rsa_dtl_gene_sample_pn %>% anova)$"Sum Sq")) 
names(formatted_anova_pn_rsa_dtl) <- rownames(lm_rsa_dtl_gene_sample_pn %>% anova)
combined <- sum(formatted_anova_pn_rsa_dtl["ANY_dist"], formatted_anova_pn_rsa_dtl["rel_solvent_acc"])
print(combined %>% round(2))
rm(formatted_anova_pn_rsa_dtl)

(...) In a combined model, RSA and DTL jointly explained 14.12% of pN(site) variation, and after adjusting for gene-to-gene and sample-to-sample variance, 17.07% of the remaining variation could be explained by RSA and DTL (...)

# R
# Prerequisite: Analysis 11
lm_rsa_dtl_gene_sample_pn <- readRDS("../lm_rsa_dtl_gene_sample_pn.RDS")
formatted_anova_pn_rsa_dtl <- (100*(lm_rsa_dtl_gene_sample_pn %>% anova)$"Sum Sq"/sum((lm_rsa_dtl_gene_sample_pn %>% anova)$"Sum Sq")) 
names(formatted_anova_pn_rsa_dtl) <- rownames(lm_rsa_dtl_gene_sample_pn %>% anova)
combined <- sum(formatted_anova_pn_rsa_dtl["ANY_dist"], formatted_anova_pn_rsa_dtl["rel_solvent_acc"])
print(100*combined / (combined + formatted_anova_pn_rsa_dtl["Residuals"]) %>% round(2))
rm(formatted_anova_pn_rsa_dtl)

(...) In comparison, only 0.35% of pS(site) variation was explained by RSA and DTL (...)

# R
# Prerequisite: Analysis 11
lm_rsa_dtl_gene_sample_ps <- readRDS("../lm_rsa_dtl_gene_sample_ps.RDS")
formatted_anova_ps_rsa_dtl <- (100*(lm_rsa_dtl_gene_sample_ps %>% anova)$"Sum Sq"/sum((lm_rsa_dtl_gene_sample_ps %>% anova)$"Sum Sq")) 
names(formatted_anova_ps_rsa_dtl) <- rownames(lm_rsa_dtl_gene_sample_ps %>% anova)
combined <- sum(formatted_anova_ps_rsa_dtl["ANY_dist"], formatted_anova_ps_rsa_dtl["rel_solvent_acc"])
print(100*combined_ps / (combined_ps + formatted_anova_ps_rsa_dtl["Residuals"]) %>% round(2))
rm(formatted_anova_ps_rsa_dtl)

(...) Analyzing gene-sample pairs revealed that the extent of ns-polymorphism rate that can be explained by RSA and DTL is not uniform across all genes (Table S7) and can reach up to 52.6% and 51.4%, respectively (Figures S7, S8) (...)

# R
print(poly_corr %>% pull(pn_rsq_rsa) %>% max(na.rm=T))
print(poly_corr %>% pull(pn_rsq_dist) %>% max(na.rm=T))

(...) Linear regressions of these data show that 83.6% of per-group ns-polymorphism rates and 20.7% of per-group s-polymorphism rates are explained by RSA and DTL (Supplementary Information) (...)

# R
print(pn_group_model %>% summary())
print(ps_group_model %>% summary())

Measuring purifying selection between genes and environments with pN/pS(gene)

(...) Taking advantage of the large number of metagenomes in which 1a.3.V was present, we calculated pN/pS(gene) for all 799 protein-coding core genes across 74 samples (see Methods), resulting in 59,126 gene/sample pairs (Table S9) (...)

# R
pnps %>% dim() %>% .[[1]]

(...) We validated our calculations by comparing sample-averaged pN/pS(gene) to dN/dS(gene) calculated from homologous gene pairs between HIMB83 and HIMB122, another SAR11 isolate genome that is closely related to HIMB83 (gANI: 82.6%), which we found to yield commensurable results (Figure S10, Table S12, Supplementary Information) (...)

# python
# Prerequisite: Aux. Step 1
import pandas as pd
df = pd.read_csv("07_ANI_HIMB122/ANIb_percentage_identity.txt", sep='\t').set_index('key')
print(df['HIMB083'].sort_values())

(...) All but one gene (gene #2031, unknown function) maintained pN/pS(gene) << 1 in every sample, whereby 95% of values were less than 0.15 (Figure S12, Table S9) (...)

# R
pnps %>% filter(pnps > 1) %>% select(gene_callers_id, function_COG20_FUNCTION, function_KOfam, function_Pfam)

(...) All but one gene (gene #2031, unknown function) maintained pN/pS(gene) << 1 in every sample, whereby 95% of values were less than 0.15 (Figure S12, Table S9) (...)

# R
pnps %>% .$pnps %>% quantile(0.95, na.rm=T)

(...) In fact, gene-to-gene variance, as opposed to sample-to-sample variance, explained 93% of pN/pS(gene) variation (ANOVA, Figure S11) (...)

# R
my_anova <- pnps %>% mutate(gene_callers_id = as.factor(gene_callers_id)) %>% lm(pnps ~ gene_callers_id + sample_id, data = .) %>% anova()
gene_explained <- my_anova$`Sum Sq`[1] / sum(my_anova$`Sum Sq`) * 100
sample_explained <- my_anova$`Sum Sq`[2] / sum(my_anova$`Sum Sq`) * 100
print(my_anova)
print(gene_explained)
print(sample_explained)

(...) By analyzing the companion metatranscriptomic data (Salazar et al. 2019) that were available for 50 of the 74 metagenomes, we were able to explain 29% of gene-to-gene variance with gene transcript abundance (Table S13, Supplementary Information) (...)

# R
cor(TA_sample_averaged$log_pnps_mean, TA_sample_averaged$log_TA_median, use="pairwise.complete.obs") ^ 2 * 100

(...) The amount of pN/pS(gene) variation attributable to sample-to-sample variance was only 0.7% (Figure S11) (...)

# R
my_anova <- pnps %>% mutate(gene_callers_id = as.factor(gene_callers_id)) %>% lm(pnps ~ gene_callers_id + sample_id, data = .) %>% anova()
gene_explained <- my_anova$`Sum Sq`[1] / sum(my_anova$`Sum Sq`) * 100
sample_explained <- my_anova$`Sum Sq`[2] / sum(my_anova$`Sum Sq`) * 100
print(my_anova)
print(gene_explained)
print(sample_explained)

Nitrogen availability governs rates of non-ideal polymorphism at critical sites of glutamine synthetase

(...) [T]he sample-averaged pN/pS(GS) was 0.02, ranking GS amongst the top 11% most purified genes (Figure 3b, Table S9) (...)

# R
pnps %>% group_by(gene_callers_id) %>% summarize(pnps=mean(pnps, na.rm=T)) %>% filter(gene_callers_id == 2602) %>% pull(pnps)

(...) Indeed, the sample-averaged pN/pS(GS) was 0.02, ranking GS amongst the top 11% most purified genes (Figure 3b, Table S9) (...)

# R
pnps %>% group_by(gene_callers_id) %>% summarize(pnps=mean(pnps, na.rm=T)) %>% arrange(pnps) %>% mutate(rank = 1:nrow(.)) %>% filter(gene_callers_id==2602) %>% pull(rank) / (pnps %>% .$gene_callers_id %>% unique() %>% length()) * 100

(...) Although highly purified, we observed significant sample-to-sample variation in pN/pS(GS) (min = 0.010, max = 0.036) suggesting that the strength of purifying selection on GS varies from sample to sample (Figure 3b inset) (...)

# R
print(pnps %>% filter(gene_callers_id == 2602) %>% pull(pnps) %>% min())
print(pnps %>% filter(gene_callers_id == 2602) %>% pull(pnps) %>% max())

(...) [We] found a positive correlation between measured nitrate concentrations and pN/pS(GS) values across samples (Pearson correlation p-value = 0.009, R2 = 0.11) (Figure 3c), which ranked amongst the top 12% of positive correlations between pN/pS(gene) and nitrate concentration (Figure 3c inset, Table S10) (...)

# R
temp <- pnps %>% filter(gene_callers_id == 2602)
print(cor.test(temp$pnps, temp$nitrates, use='pairwise.complete.obs') %>% .$p.value)
print(cor.test(temp$pnps, temp$nitrates, use='pairwise.complete.obs') %>% .$estimate %>% .^2)

(...) [We] found a positive correlation between measured nitrate concentrations and pN/pS(GS) values across samples (Pearson correlation p-value = 0.009, R2 = 0.11) (Figure 3c), which ranked amongst the top 12% of positive correlations between pN/pS(gene) and nitrate concentration (Figure 3c inset, Table S10) (...)

# R
GS <- env_corr %>% filter(gene_callers_id==2602) %>% pull(cor_nitrates)
sum(env_corr %>% pull(cor_nitrates) > GS) / dim(env_corr)[[1]] * 100

(...) From this stitched quaternary structure we recalculated RSA and DTL, and as expected, this yielded lower average RSA and DTL estimates due to the presence of adjacent monomers (0.17 versus 0.24 for RSA and 17.8Å versus 21.2Å for DTL) (...)

# R
dists <- scvs %>%
filter(
gene_callers_id == 2602,
sample_id == 'ANE_004_05M' # a random sample will do
) %>%
select(
RSA_mon = rel_solvent_acc,
RSA_dec = complex_RSA,
DTL_mon = GLU_dist_2602,
DTL_dec = complex_DTL,
)
print(dists %>% pull(RSA_dec) %>% mean(na.rm=T) # complex)
print(dists %>% pull(RSA_mon) %>% mean(na.rm=T) # monomer)
print(dists %>% pull(DTL_dec) %>% mean(na.rm=T) # complex)
print(dists %>% pull(DTL_mon) %>% mean(na.rm=T) # monomer)

(...) With these quaternary estimates of RSA and DTL, we found that ns-polymorphism was 30x less common than s-polymorphism (...)

# R
scvs %>% filter(gene_callers_id==2602) %>% pull(pS_popular_consensus) %>% mean(na.rm=T) / scvs %>% filter(gene_callers_id==2602) %>% pull(pN_popular_consensus) %>% mean(na.rm=T)

(...) In comparison, s-polymorphism distributed relatively homogeneously throughout the protein, whereby 17% of s-polymorphism occurred within 10Å of active sites (compared to 3% for ns-polymorphism) and 19% occurred in sites with 0 RSA (compared to 9% for ns-polymorphism) (...)

# R
temp <- scvs %>% filter(gene_callers_id==2602)
total_pS <- temp %>% pull(pS_popular_consensus) %>% sum(na.rm=T)
small_DTL <- temp %>% filter(complex_DTL < 10) %>% pull(pN_popular_consensus) %>% sum(na.rm=T)
small_DTL/total_pS*100

(...) In comparison, s-polymorphism distributed relatively homogeneously throughout the protein, whereby 17% of s-polymorphism occurred within 10Å of active sites (compared to 3% for ns-polymorphism) and 19% occurred in sites with 0 RSA (compared to 9% for ns-polymorphism) (...)

# R
temp <- scvs %>% filter(gene_callers_id==2602)
total_pN <- temp %>% pull(pN_popular_consensus) %>% sum(na.rm=T)
small_DTL_pN <- temp %>% filter(complex_DTL < 10) %>% pull(pN_popular_consensus) %>% sum(na.rm=T)
small_DTL_pN/total_pN*100

(...) In comparison, s-polymorphism distributed relatively homogeneously throughout the protein, whereby 17% of s-polymorphism occurred within 10Å of active sites (compared to 3% for ns-polymorphism) and 19% occurred in sites with 0 RSA (compared to 9% for ns-polymorphism) (...)

# R
temp <- scvs %>% filter(gene_callers_id==2602)
total_pS <- temp %>% pull(pS_popular_consensus) %>% sum(na.rm=T)
small_RSA_pS <- temp %>% filter(complex_RSA == 0) %>% pull(pS_popular_consensus) %>% sum(na.rm=T)
small_RSA_pS/total_pS*100

(...) In comparison, s-polymorphism distributed relatively homogeneously throughout the protein, whereby 17% of s-polymorphism occurred within 10Å of active sites (compared to 3% for ns-polymorphism) and 19% occurred in sites with 0 RSA (compared to 9% for ns-polymorphism) (...)

# R
temp <- scvs %>% filter(gene_callers_id==2602)
total_pN <- temp %>% pull(pN_popular_consensus) %>% sum(na.rm=T)
small_RSA_pN <- temp %>% filter(complex_RSA == 0) %>% pull(pN_popular_consensus) %>% sum(na.rm=T)
small_RSA_pN/total_pN*100

(...) Averaged across samples, the mean RSA was 0.15 for s-polymorphism and 0.33 for ns-polymorphism (Figure 3e left panel) (...)

# R
temp <- scvs %>% filter(gene_callers_id==2602)
# mean RSA of pS
print(temp %>% group_by(sample_id) %>% mutate(pS_weighted_RSA = pS_popular_consensus/sum(pS_popular_consensus, na.rm=T)*complex_RSA) %>% summarize(mean_RSA = sum(pS_weighted_RSA, na.rm=T)) %>% pull(mean_RSA) %>% mean())
# mean RSA of pN
print(temp %>% group_by(sample_id) %>% mutate(pN_weighted_RSA = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_RSA) %>% summarize(mean_RSA = sum(pN_weighted_RSA, na.rm=T)) %>% pull(mean_RSA) %>% mean())

(...) Similarly, the mean DTL was 17.2Å for s-polymorphism and 22.9Å for ns-polymorphism (Figure 3f left panel) (...)

# R
temp <- scvs %>% filter(gene_callers_id == 2602)
# mean DTL of pS
print(temp %>% group_by(sample_id) %>% mutate(pN_weighted_DTL = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_DTL) %>% summarize(mean_DTL = sum(pN_weighted_DTL, na.rm=T)) %>% pull(mean_DTL) %>% mean())
# mean DTL of pN
print(temp %>% group_by(sample_id) %>% mutate(pN_weighted_DTL = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_DTL) %>% summarize(mean_DTL = sum(pN_weighted_DTL, na.rm=T)) %>% pull(mean_DTL) %>% mean())

(...) [T]he mean RSA of s-polymorphism remained relatively invariant (standard deviation 0.005) (Figure 3e right panel) (...)

# R
scvs %>% filter(gene_callers_id==2602) %>% group_by(sample_id) %>% mutate(pS_weighted_RSA = pS_popular_consensus/sum(pS_popular_consensus, na.rm=T)*complex_RSA) %>% summarize(mean_RSA = sum(pS_weighted_RSA, na.rm=T), pnps=mean(pnps, na.rm=T)) %>% pull(mean_RSA) %>% sd()

(...) [T]he mean RSA of ns-polymorphism varied dramatically from 0.27 to 0.37 and was profoundly influenced by sample pN/pS(GS) (...)

# R
print(scvs %>% filter(gene_callers_id==2602) %>% group_by(sample_id) %>% mutate(pN_weighted_RSA = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_RSA) %>% summarize(mean_RSA = sum(pN_weighted_RSA, na.rm=T)) %>% pull(mean_RSA) %>% min())
print(scvs %>% filter(gene_callers_id==2602) %>% group_by(sample_id) %>% mutate(pN_weighted_RSA = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_RSA) %>% summarize(mean_RSA = sum(pN_weighted_RSA, na.rm=T)) %>% pull(mean_RSA) %>% max())

(...) 82.9% of mean RSA ns-polymorphism variance could be explained by pN/pS(GS) alone (Pearson correlation, p-value < 1x10-16, R2 = 0.829) (...)

# R
scvs %>% filter(gene_callers_id==2602) %>% group_by(sample_id) %>% mutate(pN_weighted_RSA = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_RSA) %>% summarize(mean_RSA = sum(pN_weighted_RSA, na.rm=T), pnps=mean(pnps, na.rm=T)) %>% lm(mean_RSA~pnps, data=.) %>% summary()

(...) This correlation disappeared when the sites were shuffled (Pearson correlation, R2 = 0.014, standard error 0.006 from 10 trials) (...)

# R
N <- 10
pN_RSA_R2s <- c()
pS_RSA_R2s <- c()
gs <- scvs %>%
    filter(gene_callers_id == 2602)
for (i in 1:N) {
    sample_averaged <- gs %>%
        group_by(sample_id) %>%
        mutate(
            shuffled_RSA = sample(complex_RSA),
            pS_weighted_RSA = pS_popular_consensus/sum(pS_popular_consensus, na.rm=T)*complex_RSA,
            pS_weighted_RSA_shuff = pS_popular_consensus/sum(pS_popular_consensus, na.rm=T)*shuffled_RSA,
            pN_weighted_RSA = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_RSA,
            pN_weighted_RSA_shuff = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*shuffled_RSA,
        ) %>%
        summarize(
            mean_pS_RSA=sum(pS_weighted_RSA, na.rm=T),
            mean_pS_RSA_shuff=sum(pS_weighted_RSA_shuff, na.rm=T),
            mean_pN_RSA=sum(pN_weighted_RSA, na.rm=T),
            mean_pN_RSA_shuff=sum(pN_weighted_RSA_shuff, na.rm=T),
            pnps = as.numeric(names(which.max(table(pnps))))
        )
    pN_RSA_R2s <- c(sample_averaged %>% lm(mean_pN_RSA_shuff~pnps, data=.) %>% summary() %>% .$r.squared, pN_RSA_R2s)
    pS_RSA_R2s <- c(sample_averaged %>% lm(mean_pS_RSA_shuff~pnps, data=.) %>% summary() %>% .$r.squared, pS_RSA_R2s)
}
pN_RSA_R2_mean <- mean(pN_RSA_R2s)
pN_RSA_R2_stderr <- sd(pN_RSA_R2s) / sqrt(N)
pS_RSA_R2_mean <- mean(pS_RSA_R2s)
pS_RSA_R2_stderr <- sd(pS_RSA_R2s) / sqrt(N)
print(paste("% pN(site) weighted RSA R2:", pN_RSA_R2_mean, "+-", pN_RSA_R2_stderr))

(...) ns-polymorphism distributions with respect to DTL were equally governed by selection strength, where 80.4% of variance could be explained by pN/pS(GS) (Pearson correlation, p-value 1x10-16, R2 = 0.804, Figure 3f) (...)

# R
scvs %>% filter(gene_callers_id==2602) %>% group_by(sample_id) %>% mutate(pN_weighted_DTL = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_DTL) %>% summarize(mean_DTL = sum(pN_weighted_DTL, na.rm=T), pnps=mean(pnps, na.rm=T)) %>% lm(mean_DTL~pnps, data=.) %>% summary() %>% .$r.squared

(...) This correlation disappeared when the sites were shuffled (Pearson correlation, R2 = 0.011, standard error 0.004 from 10 trials) (...)

# R
N <- 10
pN_DTL_R2s <- c()
pS_DTL_R2s <- c()
gs <- scvs %>%
    filter(gene_callers_id == 2602)
for (i in 1:N) {
    sample_averaged <- gs %>%
        group_by(sample_id) %>%
        mutate(
            shuffled_DTL = sample(complex_DTL),
            pS_weighted_DTL = pS_popular_consensus/sum(pS_popular_consensus, na.rm=T)*complex_DTL,
            pS_weighted_DTL_shuff = pS_popular_consensus/sum(pS_popular_consensus, na.rm=T)*shuffled_DTL,
            pN_weighted_DTL = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_DTL,
            pN_weighted_DTL_shuff = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*shuffled_DTL,
        ) %>%
        summarize(
            mean_pS_DTL=sum(pS_weighted_DTL, na.rm=T),
            mean_pS_DTL_shuff=sum(pS_weighted_DTL_shuff, na.rm=T),
            mean_pN_DTL=sum(pN_weighted_DTL, na.rm=T),
            mean_pN_DTL_shuff=sum(pN_weighted_DTL_shuff, na.rm=T),
            pnps = as.numeric(names(which.max(table(pnps))))
        )
    pN_DTL_R2s <- c(sample_averaged %>% lm(mean_pN_DTL_shuff~pnps, data=.) %>% summary() %>% .$r.squared, pN_DTL_R2s)
    pS_DTL_R2s <- c(sample_averaged %>% lm(mean_pS_DTL_shuff~pnps, data=.) %>% summary() %>% .$r.squared, pS_DTL_R2s)
}
pN_DTL_R2_mean <- mean(pN_DTL_R2s)
pN_DTL_R2_stderr <- sd(pN_DTL_R2s) / sqrt(N)
pS_DTL_R2_mean <- mean(pS_DTL_R2s)
pS_DTL_R2_stderr <- sd(pS_DTL_R2s) / sqrt(N)
print(paste("% pN(site) weighted DTL R2:", pN_DTL_R2_mean, "+-", pN_DTL_R2_stderr))

(...) [Figure 3B:] The vertical green line depicts the sample-averaged pN/pS(gene) for GS (0.020) (...)

# R
pnps %>% group_by(gene_callers_id) %>% summarize(pnps=mean(pnps, na.rm=T)) %>% filter(gene_callers_id==2602) %>% pull(pnps)

(...) [Figure 3b:] The inset plot shows the distribution of pN/pS(gene) value for GS as seen across the 74 samples, which vary from 0.010 to 0.036 (...)

# R
print(pnps %>% filter(gene_callers_id==2602) %>% pull(pnps) %>% min())
print(pnps %>% filter(gene_callers_id==2602) %>% pull(pnps) %>% max())

(...) [Figure 3c:] The vertical green line depicts the correlation coefficient for GS (0.34) (...)

# R
env_corr %>% filter(gene_callers_id == 2602) %>% pull(cor_nitrates)

(...) We identified putative sites fitting this description by scoring sites based on the extent that their amino acid minor allele frequencies co-varied with pN/pS(GS), including only sites with DTL less than the mean DTL of ns-polymorphisms (22.9Å) (...)

# R
scvs %>% filter(gene_callers_id == 2602) %>% group_by(sample_id) %>% mutate(pN_weighted_DTL = pN_popular_consensus/sum(pN_popular_consensus, na.rm=T)*complex_DTL) %>% summarise(mean_pN_DTL=sum(pN_weighted_DTL, na.rm=T)) %>% pull(mean_pN_DTL) %>% mean()

(...) Though each of these sites exhibited DTL lower than the average ns-polymorphism, the closest site (residue number 323) was still 9Å away from the glutamate substrate (...)

# R
scvs %>% filter(gene_callers_id==2602) %>% filter(sample_id == 'ANE_004_05M', codon_order_in_gene %in% (c(96, 152, 175, 176, 230, 288, 323, 364, 379)-1)) %>% select(codon_order_in_gene, complex_DTL, complex_RSA) %>% mutate(codon_number = codon_order_in_gene + 1) %>% select(-codon_order_in_gene)

(...) After all, in absolute terms GS is highly purified regardless of sample – the largest pN/pS(GS) is 0.036, which is just over half the genome-wide average pN/pS(gene) of 0.063 (...)

# R
pnps %>% filter(gene_callers_id==2602) %>% pull(pnps) %>% max()

(...) After all, in absolute terms GS is highly purified regardless of sample – the largest pN/pS(GS) is 0.036, which is just over half the genome-wide average pN/pS(gene) of 0.063 (...)

# R
pnps %>% pull(pnps) %>% mean(na.rm=T)

(...) Though highly significant (one sided Pearson p-values 9x10-12 for RSA and 2x10-4 for DTL), the magnitude that ns-polymorphism distributions shift with respect to DTL and RSA were subtle (...)

# R
temp <- scvs %>%
filter(
!is.na(pN_popular_consensus),
ANY_dist <= 40
) %>%
group_by(sample_id) %>%
mutate(
pS_weighted_DTL = pS_popular_consensus/sum(pS_popular_consensus)*ANY_dist,
pN_weighted_DTL = pN_popular_consensus/sum(pN_popular_consensus)*ANY_dist,
pS_weighted_RSA = pS_popular_consensus/sum(pS_popular_consensus)*rel_solvent_acc,
pN_weighted_RSA = pN_popular_consensus/sum(pN_popular_consensus)*rel_solvent_acc,
rarity_weighted_DTL = codon_rarity/sum(codon_rarity)*ANY_dist,
rarity_weighted_RSA = codon_rarity/sum(codon_rarity)*rel_solvent_acc
) %>%
summarise(
mean_pS_DTL=sum(pS_weighted_DTL),
mean_pN_DTL=sum(pN_weighted_DTL),
mean_pS_RSA=sum(pS_weighted_RSA),
mean_pN_RSA=sum(pN_weighted_RSA),
mean_rarity_DTL=sum(rarity_weighted_DTL),
mean_rarity_RSA=sum(rarity_weighted_RSA),
pnps=mean(genome_pnps),
temperature=mean(temperature)
)
print(cor.test(temp$mean_pN_RSA, temp$pnps, alternative='less') %>% .$p.val)
print(cor.test(temp$mean_pN_DTL, temp$pnps, alternative='less') %>% .$p.val)

(...) [Figure 4:] (C) The s-polymorphism distribution mean with respect to RSA is negatively associated with pN/pS(core) (one-sided Pearson p-value = 1x10-5). (D) The s-polymorphism distribution mean with respect to RSA is negatively associated with pN/pS(core) (one-sided Pearson p-value = 3x10-7). (...)

# R
temp <- scvs %>%
filter(
!is.na(pN_popular_consensus),
ANY_dist <= 40
) %>%
group_by(sample_id) %>%
mutate(
pS_weighted_DTL = pS_popular_consensus/sum(pS_popular_consensus)*ANY_dist,
pN_weighted_DTL = pN_popular_consensus/sum(pN_popular_consensus)*ANY_dist,
pS_weighted_RSA = pS_popular_consensus/sum(pS_popular_consensus)*rel_solvent_acc,
pN_weighted_RSA = pN_popular_consensus/sum(pN_popular_consensus)*rel_solvent_acc,
rarity_weighted_DTL = codon_rarity/sum(codon_rarity)*ANY_dist,
rarity_weighted_RSA = codon_rarity/sum(codon_rarity)*rel_solvent_acc
) %>%
summarise(
mean_pS_DTL=sum(pS_weighted_DTL),
mean_pN_DTL=sum(pN_weighted_DTL),
mean_pS_RSA=sum(pS_weighted_RSA),
mean_pN_RSA=sum(pN_weighted_RSA),
mean_rarity_DTL=sum(rarity_weighted_DTL),
mean_rarity_RSA=sum(rarity_weighted_RSA),
pnps=mean(genome_pnps),
temperature=mean(temperature)
)
print(cor.test(temp$mean_pS_RSA, temp$pnps, alternative='less') %>% .$p.val)
print(cor.test(temp$mean_pS_DTL, temp$pnps, alternative='less') %>% .$p.val)

(...) [Figure 4:] (E) Rare synonymous codons are more abundant in samples with high pN/pS(core) (one-sided Pearson p-value = 4x10-5). (F) Rare synonymous codons avoid low RSA sites when pN/pS(core) is low (one-sided Pearson p-value = 1x10-10). (G) Rare synonymous codons avoid low DTL sites when pN/pS(core) is low (one-sided Pearson p-value = 7x10-9) (...)

# R

pN_cutoff <- 0.0005

# E
plot_data <- scvs %>%
filter(pN_popular_consensus < pN_cutoff) %>%
group_by(sample_id) %>%
summarise(
codon_rarity=mean(codon_rarity),
pnps=mean(genome_pnps)
)
print(cor.test(plot_data$codon_rarity, plot_data$pnps, alternative='greater') %>% .$p.val)

# F
plot_data <- scvs %>%
filter(
pN_popular_consensus < pN_cutoff,
!is.na(rel_solvent_acc) # Filter out any genes without structures
) %>%
group_by(sample_id) %>%
mutate(
rarity_weighted_RSA = codon_rarity/sum(codon_rarity)*rel_solvent_acc
) %>%
summarise(
mean_rarity_RSA=sum(rarity_weighted_RSA),
pnps=mean(genome_pnps)
)
print(cor.test(plot_data$mean_rarity_RSA, plot_data$pnps, alternative='less') %>% .$p.val)

# G
plot_data <- scvs %>%
filter(
pN_popular_consensus < pN_cutoff,
!is.na(ANY_dist) # Filter out any genes without structures or binding sites
) %>%
group_by(sample_id) %>%
mutate(
rarity_weighted_DTL = codon_rarity/sum(codon_rarity)*ANY_dist
) %>%
summarise(
mean_rarity_DTL=sum(rarity_weighted_DTL),
pnps=mean(genome_pnps)
)
print(cor.test(plot_data$mean_rarity_DTL, plot_data$pnps, alternative='less') %>% .$p.val)

Synonymous but not silent: selection against rare codons at critical sites

(...) With a GC-content lower than 30%, SAR11 genomes maintain a non-uniform yet conserved codon composition (Figure S13) (...)

# python
import anvio.utils as utils
seqs = utils.get_FASTA_file_as_dictionary('contigs.fa')
concated_seq = ''.join([s for s in seqs.values()])
print(utils.get_GC_content_for_sequence(seq))

All other numbers in this section are found in the code blocks pertaining to Figure 4 (above).

Methods

Predicting and processing protein structures

(...) To avoid this, we discarded any templates if the alignment coverage of the protein sequence to the template was <80%. Applying these filters resulted in 408 structures from the 1a.3.V core (...)

# python
import anvio.structureops as sops
sdb = sops.StructureDatabase('09_STRUCTURE_MOD.db')
goi = [int(x.strip()) for x in open('goi').readlines()]
len([x for x in sdb.genes_with_structure if x in goi])

(...) [The number of MODELLER-predicted structures ]was further refined by requiring that the root mean squared distance (RMSD) between the predicted structure and the most similar template did not exceed 7.5 Å, and that the GA341 model score exceeded 0.95. After applying these constraints, we were left with 348 structures in the 1a.3.V that we assumed to be ‘trustworthy’ structures as predicted by MODELLER (...)

# bash
wc -l 12_GENES_WITH_GOOD_STRUCTURES_MODELLER

(...) These structures were on average 44.8% identical to their templates, which is within the sequence similarity regime where template-based homology modeling generally produces the correct overall fold (Rost 1999) (...)

# R
structure_val <- read_tsv("../09_STRUCTURE_comparison.txt")
structure_val %>% filter(has_MOD, RMSD <= 7.5) %>% .$template_similarity %>% mean()

Predicting ligand-binding sites

(...) We then filtered out binding frequency scores less than 0.5, yielding 40,219 predicted ligand-residue interactions across 11,480 unique sites (Table S5) (...)

# R
temp <- read_tsv('../08_INTERACDOME-binding_frequencies.txt')
temp %>% dim() %>% .[[1]]

(...) We then filtered out binding frequency scores less than 0.5, yielding 40,219 predicted ligand-residue interactions across 11,480 unique sites (Table S5) (...)

# R
temp <- read_tsv('../08_INTERACDOME-binding_frequencies.txt')
temp %>% mutate(new = paste(gene_callers_id, codon_order_in_gene)) %>% pull(new) %>% unique() %>% length()

Calculating distance-to-ligand (DTL)

(...) To mitigate the influence of this inevitable error source, we conservatively excluded DTL values >40 Å (8.0% of sites) in all analyses after Figure 2b (...)

# R
ggs <- read_tsv("../12_GENES_WITH_GOOD_STRUCTURES", col_names=F) %>% pull(X1)
sar11_data <- scvs %>%
filter(
sample_id == 'ANE_004_05M', # pick random sample to massively subset data
gene_callers_id %in% ggs, # only genes with predicted structure
!is.na(ANY_dist) # only genes with predicted ligands
)
100 * sar11_data %>% filter(ANY_dist >= 40) %>% dim() %>% .[[1]] / sar11_data %>% dim() %>% .[[1]]

Proportion of polymorphism rate variance explained by RSA and DTL

(...) After excluding monomorphic sites (pN(site) = 0 for ns-models, pS(site) = 0 for s-models), this yielded 5,374,739 data points for s-models and 3,530,079 for ns-models (...)

# R
lm_rsa_gene_sample_ps <- readRDS("../lm_rsa_gene_sample_ps.RDS")
print(lm_rsa_gene_sample_ps$model %>% dim())
rm(lm_rsa_gene_sample_ps) # takes up lots of memory, remove when finished

lm_rsa_gene_sample_pn <- readRDS("../lm_rsa_gene_sample_pn.RDS")
print(lm_rsa_gene_sample_pn$model %>% dim())
rm(lm_rsa_gene_sample_pn) # takes up lots of memory, remove when finished

Supplementary Figures

(...) The two groups were defined as having TM scores above or below 0.8, where the >0.8 group corresponded to the 291 best alignments (left) and the <0.8 group corresponded to the 48 worst alignments (...)

# R
df <- read_tsv("../09_STRUCTURE_comparison.txt") %>% mutate(frac_ss_MOD = frac_alpha_MOD + frac_beta_MOD, frac_ss_AF = frac_alpha_AF + frac_beta_AF)
df %>% filter(has_MOD, has_AF) %>% mutate(cohort = ifelse(TM_score > 0.8, 'cohort1', 'cohort2')) %>% filter(cohort=='cohort1') %>% dim() %>% .[[1]]

(...) [Figure S4:] The two groups were defined as having TM scores above or below 0.8, where the >0.8 group corresponded to the 291 best alignments (left) and the <0.8 group corresponded to the 48 worst alignments (...)

# R
df <- read_tsv("../09_STRUCTURE_comparison.txt") %>% mutate(frac_ss_MOD = frac_alpha_MOD + frac_beta_MOD, frac_ss_AF = frac_alpha_AF + frac_beta_AF)
df %>% filter(has_MOD, has_AF) %>% mutate(cohort = ifelse(TM_score < 0.8, 'cohort1', 'cohort2')) %>% filter(cohort=='cohort1') %>% dim() %>% .[[1]]

(...) [Figure S14:] Using lysine as an example, this led to on average 21,127 sites per sample (...)

# R
tmp <- scvs %>% filter(pN_popular_consensus <= 0.0005, most_freq_aa == 'Lys')
tmp %>% group_by(sample_id) %>% summarise(count(.)) %>% mutate(x=1) %>% group_by(x) %>% summarise(min(n), max(n), median(n), mean(n), sd(n))

Supplementary Information

Regimes of sequence similarity probed by metagenomics, SAR11 cultured genomes, and protein families

(...) Unsurprisingly, protein families are most evolutionarily divergent (mean amino acid PS 28.8%). Relative to SAR11 homologs (mean nucleotide PS 77.3%), the aligned reads are highly related (mean nucleotide PS 94.5%), showing that metagenomics offers a modality of sequence inquiry more highly resolved than sequence comparisons between isolated cultures (...)

# R
args <- list()
args$reads <- "../18_PERCENT_ID.txt"
args$pfam <- "../18_PERCENT_ID_PFAM.txt"
args$pangenome <- "../18_PERCENT_ID_PANGENOME.txt"
reads <- read_tsv(args$reads) %>%
pivot_longer(cols=-gene_callers_id) %>%
group_by(gene_callers_id) %>%
summarise(reads = mean(value))
pfam <- read_tsv(args$pfam) %>%
rename(pfam = percent_id)
pan <- read_tsv(args$pangenome) %>%
rename(pan = percent_id)
df_wide <- reads %>%
left_join(pfam) %>%
left_join(pan)
df <- df_wide %>%
pivot_longer(cols=-gene_callers_id)
options(pillar.sigfig = 5)
print(df %>% group_by(name) %>% summarise(mean=mean(value, na.rm=TRUE), median=median(value, na.rm=TRUE)))

Comparing structure predictions between AlphaFold and MODELLER

(...) While AlphaFold produced 754 structures we deemed trustworthy (see Methods), MODELLER produced 346 due to its reliance on pre-existing template structures (...)

# bash
wc -l 12_GENES_WITH_GOOD_STRUCTURES
wc -l 12_GENES_WITH_GOOD_STRUCTURES_MODELLER

(...) In 339 cases both methods procured a structure prediction for a given protein sequence (...)

# python
goi_af = set([int(x.strip()) for x in open('12_GENES_WITH_GOOD_STRUCTURES').readlines()])
goi_mod = set([int(x.strip()) for x in open('12_GENES_WITH_GOOD_STRUCTURES_MODELLER').readlines()])
len(goi_af.intersection(goi_mod))

(...) Since a TM score of 0.5 indicates that proteins likely belong to the same fold family (Xu and Zhang 2010), our average TM score of 0.88 indicates strong overall agreement between AlphaFold and MODELLER (...)

# R
df <- read_tsv("../09_STRUCTURE_comparison.txt") %>% mutate(frac_ss_MOD = frac_alpha_MOD + frac_beta_MOD, frac_ss_AF = frac_alpha_AF + frac_beta_AF)
df$TM_score %>% mean(na.rm=T)

(...) In fact, for the worst alignments (TM score <0.6), in 15 of 16 cases AlphaFold yielded more secondary structure (...)

# R
df <- read_tsv("../09_STRUCTURE_comparison.txt") %>% mutate(frac_ss_MOD = frac_alpha_MOD + frac_beta_MOD, frac_ss_AF = frac_alpha_AF + frac_beta_AF)
print(df %>% filter(TM_score < 0.6, frac_ss_AF > frac_ss_MOD) %>% dim() %>% .[[1]])
print(df %>% filter(TM_score < 0.6) %>% dim() %>% .[[1]])

(...) Even still, these structures averaged a mean pLDDT score of 90.8, which is considered to be highly accurate (Jumper et al. 2021) (...)

# R
df <- read_tsv("../09_STRUCTURE_comparison.txt") %>% mutate(frac_ss_MOD = frac_alpha_MOD + frac_beta_MOD, frac_ss_AF = frac_alpha_AF + frac_beta_AF)
df %>% filter(has_AF, !has_MOD) %>% .$mean_pLDDT %>% mean()

RSA and DTL predict nonsynonymous polymorphism rates

(...) We excluded monomorphic sites (pN(site) = 0 for ns-models, pS(site) = 0 for s-models), sites with DTL > 40Å (see Methods), and removed gene-sample pairs containing <100 remaining sites, resulting in 16,285 ns-models and 24,553 s-models (Table S7) (...)

# R
print(pn_models %>% dim() %>% .[[1]])
print(ps_models %>% dim() %>% .[[1]])

(...) We filtered out any genes that did not have a predicted structure and at least one predicted ligand-binding site, which when applied in conjunction with the above filters resulted in 381 genes for the s-models and 342 genes for the ns-models (...)

# R
print(ps_models %>% pull(gene_callers_id) %>% unique() %>% length())
print(pn_models %>% pull(gene_callers_id) %>% unique() %>% length())

(...) ns-models yielded consistently positive correlations (average Pearson coefficient of rRSA = 0.353) (Figure 2c), whereas s-models exhibited correlations centered around 0 (average rRSA = -0.029) (...)

# R
print(poly_corr %>% .$pn_r_rsa %>% mean(na.rm=T))
print(poly_corr %>% .$ps_r_rsa %>% mean(na.rm=T))

(...) The average R2 was 0.137 for ns-models, however model quality varied significantly between gene-sample pairs (...)

# R
poly_corr %>% .$pn_rsq_rsa %>% mean(na.rm=T)

(...) In fact, we found that R2 varied from as high as 0.526 (gene 2264 in sample ION_42_80M), to as low as 0.0 (gene 2486 in sample ION_42_80M) (...)

# R
print(poly_corr %>% ungroup() %>% select(gene_callers_id, sample_id, pn_rsq_rsa) %>% slice(which.max(pn_rsq_rsa)))
print(poly_corr %>% ungroup() %>% select(gene_callers_id, sample_id, pn_rsq_rsa) %>% slice(which.min(pn_rsq_rsa)))

(...) Using the same procedure, we linearly regressed log10(pN(site)) and log10(pS(site)) with DTL and found that 96% of ns-models yielded positive correlations with DTL with considerable predictive power, where on average 11.5% of per-site ns-polymorphism rate variation could be explained by DTL (Table S7) (...)

# R
print(poly_corr %>% filter(pn_r_dist > 0) %>% dim() %>% .[[1]] / (poly_corr %>% filter(!is.na(pn_r_dist)) %>% dim %>% .[[1]]))
print(poly_corr %>% .$pn_rsq_dist %>% mean(na.rm=T))

(...) R2 values varied significantly, ranging from 0.514 (gene 2326 in sample PSE_100_05M) to 0.0 (gene 2246 in sample PSE_102_05M) (...)

# R
print(poly_corr %>% ungroup() %>% select(gene_callers_id, sample_id, pn_rsq_dist) %>% slice(which.max(pn_rsq_dist)))
print(poly_corr %>% ungroup() %>% select(gene_callers_id, sample_id, pn_rsq_dist) %>% slice(which.min(pn_rsq_dist)))

(...) Interestingly, we found that log10(pS(site)) on average negatively correlates with DTL (average Pearson coefficient -0.057) (...)

# R
poly_corr %>% pull(ps_r_dist) %>% mean()

(...) A Pearson correlation between RSA and DTL revealed the relative independence of each variable from the other (R2 = 0.082, r = 0.286), precluding effects of multicollinearity (Figure SI1) (...)

# R
rsa_vs_d_data <- scvs %>%
filter(sample_id == 'ANE_004_05M') %>%
select(gene_callers_id, codon_order_in_gene, rel_solvent_acc, ANY_dist) %>%
filter((!is.na(ANY_dist)))
print(summary(lm(rel_solvent_acc ~ ANY_dist, data=rsa_vs_d_data))$r.squared)
print(summary(lm(rel_solvent_acc ~ ANY_dist, data=rsa_vs_d_data))$r.squared %>% sqrt())

(...) The results revealed that including both RSA and DTL yielded a considerably better set of models for ns-polymorphism rates, with an average explained variance of 17.7% (average adjusted R2RSA-DTL = 0.177) (...)

# R
poly_corr %>% .$pn_adj_rsq %>% mean(na.rm=T)

(...) For example, the group (RSA1, DTL2) contains the 3,164 sites with RSA values in the 1st RSA range [0.00,0.01) and DTL values in the 2nd DTL range [5.0Å,6.4Å) (...)

# R
n_samples <- read_tsv("../soi", col_names=F) %>% nrow()
print(counts_group %>% filter(RSA_group == 0, DTL_group == 1) %>% .$count/n_samples)
print(pn_group %>% filter(RSA_group == 0, DTL_group == 1) %>% select(RSA_range))
print(pn_group %>% filter(RSA_group == 0, DTL_group == 1) %>% select(DTL_range))

(...) Nonsynonymous polymorphism rates of groups varied from as low as 0.001 to as high as 0.021 (...)

# R
print(pn_group %>% .$pn %>% min())
print(pn_group %>% .$pn %>% max())

(...) Sites exhibited a spectrum of ns-polymorphism rates that is roughly linear. We determined this by fitting a linear model pN(group) ~ i + j, where i refers to the group’s RSA and DTL indices (RSAi, DTLj), yielded an adjusted R2 of 0.836 (...)

# R
summary(pn_group_model)$adj.r.squared

(...) [T]he linear model pS(group) ~ i + j yielded a significant, anti-correlated relationship with both RSA and DTL (adjusted R2 of 0.206) (...)

# R
summary(ps_group_model)$adj.r.squared

(...) in the sample-gene models, (1) the mean Pearson correlation coefficient between pS(site) and RSA is -0.013 (Figure 2c), and (2) the mean Pearson correlation coefficient between pS(site) and DTL is -0.052 (Figure 2d) (...)

# R
print(poly_corr %>% .$ps_r_rsa %>% mean(na.rm=T))
print(poly_corr %>% .$ps_r_dist %>% mean(na.rm=T))

(...) [Figure SI1:] Scatter plot of RSA vs. DTL for the 143,181 sites belonging to genes with a predicted structure and at least one predicted ligand (...)

# R
rsa_vs_d_data <- scvs %>%
filter(sample_id == 'ANE_004_05M') %>%
select(gene_callers_id, codon_order_in_gene, rel_solvent_acc, ANY_dist) %>%
filter(ANY_dist < 40)
print(rsa_vs_d_data %>% dim() %>% .[[1]])

(...) The line of best fit is shown in black, The Pearson coefficient is 0.313 and the R2 is 0.098 (...)

# R
rsa_vs_d_data <- scvs %>%
filter(sample_id == 'ANE_004_05M') %>%
select(gene_callers_id, codon_order_in_gene, rel_solvent_acc, ANY_dist) %>%
filter(ANY_dist < 40)
model_r2 <- summary(lm(rel_solvent_acc ~ ANY_dist, data=rsa_vs_d_data))$r.squared
model_corr <- summary(lm(rel_solvent_acc ~ ANY_dist, data=rsa_vs_d_data))$r.squared %>% sqrt()
print(paste("R2 =", round(model_r2, 3), ", r =", round(model_corr, 3)))

(...) [Figure SI2:] Red denotes parameter/error distributions for the 16,285 nonsynonymous models of the form pN(site) $= \beta_0 + \beta_{RSA}RSA + \beta_{DTL}DTL$ and blue denotes parameter/error distributions for the 24,553 models of the form pS(site) $= \beta_0 + \beta_{RSA}RSA + \beta_{DTL}DTL (...)

# R
print(pn_models %>% dim() %>% .[[1]])
print(ps_models %>% dim() %>% .[[1]])

dN/dS(gene) and sample-averaged pN/pS(gene) yield consistent results

(...) We calculated dN/dS(gene) for 753 homologous gene pairs found between HIMB83 and a closely related cultured representative HIMB122 (see Methods) (...)

# R
pnps_sample_averaged <- pnps %>%
group_by(gene_callers_id) %>%
summarize(pnps = mean(pnps, na.rm=T))
dnds <- read_tsv("../19_DNDS_HIMB122/dnds.txt", col_names=F) %>%
rename(gene_callers_id=X1, dnds=X2) %>%
filter(!is.na(dnds))
df <- full_join(pnps_sample_averaged, dnds) %>%
mutate(log_pnps = log10(pnps), log_dnds = log10(dnds))
overlap <- df %>% filter(!is.na(pnps), !is.na(dnds))
overlap %>% dim() %>% .[[1]]

(...) ANI between HIMB83 and HIMB122 was 82.6%, whereas the average ANI between HIMB83 and recruited reads was 94.5% (...)

# python
import pandas as pd
df = pd.read_csv("07_ANI_HIMB122/ANIb_percentage_identity.txt", sep='\t').set_index('key')
df['HIMB083'].sort_values()

(...) ANI between HIMB83 and HIMB122 was 82.6%, whereas the average ANI between HIMB83 and recruited reads was 94.5% (...)

# R
args <- list()
args$reads <- "../18_PERCENT_ID.txt"
args$pfam <- "../18_PERCENT_ID_PFAM.txt"
args$pangenome <- "../18_PERCENT_ID_PANGENOME.txt"
reads <- read_tsv(args$reads) %>%
pivot_longer(cols=-gene_callers_id) %>%
group_by(gene_callers_id) %>%
summarise(reads = mean(value))
pfam <- read_tsv(args$pfam) %>%
rename(pfam = percent_id)
pan <- read_tsv(args$pangenome) %>%
rename(pan = percent_id)
df_wide <- reads %>%
left_join(pfam) %>%
left_join(pan)
df <- df_wide %>%
pivot_longer(cols=-gene_callers_id)
options(pillar.sigfig = 5)
print(df %>% group_by(name) %>% summarise(mean=mean(value, na.rm=TRUE), median=median(value, na.rm=TRUE)))

(...) We found that log-transformed sample-averaged pN/pS(gene) highly correlated with log-transformed dN/dS(gene) (Pearson R2 = 0.380) (...)

# R
pnps_sample_averaged <- pnps %>%
group_by(gene_callers_id) %>%
summarize(pnps = mean(pnps, na.rm=T))
dnds <- read_tsv("../19_DNDS_HIMB122/dnds.txt", col_names=F) %>%
rename(gene_callers_id=X1, dnds=X2) %>%
filter(!is.na(dnds))
df <- full_join(pnps_sample_averaged, dnds) %>%
mutate(log_pnps = log10(pnps), log_dnds = log10(dnds))
overlap <- df %>% filter(!is.na(pnps), !is.na(dnds))
overlap %>% lm(pnps ~ dnds, data=.) %>% summary() %>% .$r.squared

(...) The ratio between sample-averaged pN/pS(gene) and dN/dS(gene) was on average 6.23 (Figure S10), matching expectations that slightly deleterious, nonsynonymous mutants commonly drift to observable frequencies, yet far less commonly drift to fixation (...)

# R
pnps_sample_averaged <- pnps %>%
group_by(gene_callers_id) %>%
summarize(pnps = mean(pnps, na.rm=T))
dnds <- read_tsv("../19_DNDS_HIMB122/dnds.txt", col_names=F) %>%
rename(gene_callers_id=X1, dnds=X2) %>%
filter(!is.na(dnds))
df <- full_join(pnps_sample_averaged, dnds) %>%
mutate(log_pnps = log10(pnps), log_dnds = log10(dnds))
overlap <- df %>% filter(!is.na(pnps), !is.na(dnds))
overlap %>% filter(dnds > 0) %>% mutate(ratio = pnps/dnds) %>% pull(ratio) %>% mean()

Transcript abundance largely explains genic differences in the strengths of purifying selection

(...) Sample-averaged pN/pS(gene) values varied significantly between genes, varying from 0.004-0.539, with a mean of 0.063 (Figure S12, Table S9) (...)

# R
print(pnps %>% group_by(gene_callers_id) %>% summarise(pnps=mean(pnps)) %>% .$pnps %>% min(na.rm=T))
print(pnps %>% group_by(gene_callers_id) %>% summarise(pnps=mean(pnps)) %>% .$pnps %>% max(na.rm=T))
print(pnps %>% group_by(gene_callers_id) %>% summarise(pnps=mean(pnps)) %>% .$pnps %>% mean(na.rm=T))

(...) Comparing sample-median TA values to sample-averaged pN/pS(gene) values yielded a strong, negative correlation (Figure SI5b, Pearson r = -0.539, R2 = 0.290) according to an inverse power-law relationship (...)

# R
print(cor(TA_sample_averaged$log_pnps_mean, TA_sample_averaged$log_TA_median, use="pairwise.complete.obs"))
print(cor(TA_sample_averaged$log_pnps_mean, TA_sample_averaged$log_TA_median, use="pairwise.complete.obs") ^ 2)

(...) We found that of the 799 genes tested, 74% exhibited (weak) negative correlations between $log_{10}(TA+0.01)$ and $log_{10}(pN/pS(gene))$ (Figure SI5c), yet only 11.5% of genes passed significance tests (one-sided Pearson, 25% Benjamini-Hochberg false discovery rate) (Figure SI5d) (...)

# R
TA_sample_corr %>% pull(negative) %>% table() %>% .[[2]] / TA_sample_corr %>% dim() %>% .[[1]]

(...) We found that of the 799 genes tested, 74% exhibited (weak) negative correlations between $log_{10}(TA+0.01)$ and $log_{10}(pN/pS(gene))$ (Figure SI5c), yet only 11.5% of genes passed significance tests (one-sided Pearson, 25% Benjamini-Hochberg false discovery rate) (Figure SI5d) (...)

# R
TA_sample_corr %>% filter(fdr_25 == TRUE) %>% nrow() / (TA_sample_corr %>% filter(fdr_25 == FALSE) %>% nrow() + TA_sample_corr %>% filter(fdr_25 == TRUE) %>% nrow()) * 100

(...) [Figure SI5:] The linear model yielded a Pearson coefficient of -0.539, an R2 of 0.290 and a line of best fit y = (-0.31 士 0.02)x + (-1.63 士 0.02) shown in pink (95% confidence intervals shown in translucent pink) (...)

# R
lm(log_pnps_mean ~ log_TA_median, data=TA_sample_averaged) %>% summary() %>% .$coefficients

Stability analysis of polymorphism distributions with respect to pN/pS(core)

(...) Our bootstrapping stability analysis (Figure SI6, Table S14) showed that in 99.5% of gene resamplings, the mean RSA of ns-polymorphism negatively associated with pN/pS(core) (one-sided Pearson coefficient p-value <0.05), whereas in only 69.5% of gene resamplings did the mean DTL of ns-polymorphism negatively associate with pN/pS(core) (...)

# R
plot_data <- read_tsv('../genome_robust.txt')
print((plot_data %>% pull(pN_RSA_pval) <= 0.05) %>% sum() / plot_data %>% dim() %>% .[[1]])
print((plot_data %>% pull(pN_DTL_pval) <= 0.05) %>% sum() / plot_data %>% dim() %>% .[[1]])

Chapter V - Reproducing Kiefl et al, 2022

Quick Navigation

Preface

Results and Discussion

Intro

Polymorphism rates reveal intense purification of nonsynonymous mutants

Nonsynonymous polymorphism avoids buried sites

Nonsynonymous polymorphism avoids active sites

Proteomic trends in purifying selection are explained by RSA and DTL

Measuring purifying selection between genes and environments with pN/pS(gene)

Nitrogen availability governs rates of non-ideal polymorphism at critical sites of glutamine synthetase

Synonymous but not silent: selection against rare codons at critical sites

Methods

Predicting and processing protein structures

Predicting ligand-binding sites

Calculating distance-to-ligand (DTL)

Proportion of polymorphism rate variance explained by RSA and DTL

Supplementary Figures

Supplementary Information

Regimes of sequence similarity probed by metagenomics, SAR11 cultured genomes, and protein families

Comparing structure predictions between AlphaFold and MODELLER

RSA and DTL predict nonsynonymous polymorphism rates

dN/dS(gene) and sample-averaged pN/pS(gene) yield consistent results

Transcript abundance largely explains genic differences in the strengths of purifying selection

Stability analysis of polymorphism distributions with respect to pN/pS(core)