SurvBal_Vignette

Overview

SurvBal enables the selection of microbiome balances in relation to censored survival or time-to-event outcomes, which are of considerable interest in many biomedical studies. The most commonly used survival models – the Cox proportional hazards and parametric survival (including accelerated failure time) models are included in the package, which are used in combination with step-wise selection procedures to identify the optimal associated balance of microbiome, i.e., the ratio of the geometric means of two groups of taxa’s relative abundances.

Installation

Please make sure you install the R package zCompositions with version number less than or equal to 1.4.0.1 before installing SurvBal:

require(remotes) 
install_version("zCompositions", version = "1.4.0.1", repos = "http://cran.us.r-project.org")

Then you can install SurvBal from GitHub by:

# install.packages("devtools")
devtools::install_github("yinglia/SurvBal")

Please skip the recommended update for zCompositions.

Getting Started

#loading the library
library(SurvBal)

You can load the sample data in SurvBal using the following code.

#load example data
data("bacteria")
data("gvhd")

bacteria is a raw taxon count table with 63 samples and 139 taxa. gvhd is a survival object containing time and censoring/event information for each sample.

Examples

Select balance based on the Cox proportional hazards model

We can select the balance of microbiome based on the Cox proportional hazards Model by setting model = "coxph".

Here, we specify mult_repl = TRUE to use the geometric Bayesian multiplicative replacement method to impute the inflated zeros. If you would like to process the data by adding a small pseudo-count (0.5) to all raw counts instead, set mult_repl = FALSE.

We specify selection_criterion = "min_decrement_pvalue" so that the decrement of p-value along the forward selection path will be calculated, the balance before the first decrement that is smaller than selection_threshold (the default value is 0.15) will be selected. If “min_pvalue” is used, the balance that has the smallest p-value along the forward selection path will be selected. The default is “min_decrement_pvalue”.

select_coxph = balance_selection(
  Surv_obj = gvhd,
  data = bacteria,
  mult_repl = TRUE,
  model = "coxph",
  selection_criterion = "min_decrement_pvalue"
)

We can check the omnibus p-value from the global community-level association test, MiRKAT-S, which examines whether there is an overall shift in the microbiome composition (presence-absence status and abundance, encoded by Bray-Curtis and Jaccard distances) regarding the survival outcome.

select_coxph$global_p
#> [1] 0.5105105

A p-value greater than 0.05 indicates that there is no significant community-level association between the microbiome and survival outcome. Though it is not obvious whether the community-level test truly informs the validity of the balance – community-level testing is best used when there are concerted differences among a large number of taxa while the analysis of balances is about variable selection wherein the outcome is related to imbalances of a few taxa, it is recommended to interpret the final balance by SurvBal with caution, as the current algorithm may still select a balance when no associations exist.

We can find the selection path in selection_path:

select_coxph$selection_path[,1:2]
#>                                                        taxon_id sign
#> Dysosmobacter                                                33    0
#> Pseudoflavonifractor capillosus/Clostridium phoceensis       38    1
#> Clostridiales Family XIII. Incertae Sedis                    90    0
#> Anaerotignum lactatifermentans                               95    1
#> Lachnospira eligens                                         127    0
#> Eisenbergiella massiliensis                                 124    1
#> Massilimicrobiota timonensis                                 81    0
#> Acidaminococcus intestini                                    43    1
#> Agathobaculum butyriciproducens                             100    0
#> [Clostridium] leptum                                         26    0
#> Bifidobacterium catenulatum/pseudocatenulatum                55    0
#> Parasutterella excrementihominis                             48    1
#> Anaerobutyricum soehngenii                                  129    0
#> Bacteroides xylanisolvens                                    67    0
#> Eubacterium                                                  25    0
#> Anaerostipes                                                134    0
#> Alistipes putredinis                                         58    0
#> Eubacterium ventriosum                                       12    0
#> Bilophila wadsworthia                                        45    0

We can print the final fitted Cox PH model for the survival outcome, gvhd , with the selected balance:

select_coxph$survival_model
#> Call:
#> coxph(formula = Surv_obj ~ ., data = df)
#> 
#>           coef exp(coef) se(coef)     z       p
#> balance 1.3747    3.9537   0.1806 7.612 2.7e-14
#> 
#> Likelihood ratio test=55.99  on 1 df, p=7.275e-14
#> n= 63, number of events= 47
summary(select_coxph$survival_model)
#> Call:
#> coxph(formula = Surv_obj ~ ., data = df)
#> 
#>   n= 63, number of events= 47 
#> 
#>           coef exp(coef) se(coef)     z Pr(>|z|)    
#> balance 1.3747    3.9537   0.1806 7.612  2.7e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>         exp(coef) exp(-coef) lower .95 upper .95
#> balance     3.954     0.2529     2.775     5.633
#> 
#> Concordance= 0.824  (se = 0.031 )
#> Likelihood ratio test= 55.99  on 1 df,   p=7e-14
#> Wald test            = 57.94  on 1 df,   p=3e-14
#> Score (logrank) test = 68.07  on 1 df,   p=<2e-16

We can check which taxa are included in the numerator of the selected balance, as well as taxa in the denominator of the selected balance:

select_coxph$balance_name$numerator
#> [1] "Pseudoflavonifractor capillosus/Clostridium phoceensis"
#> [2] "Anaerotignum lactatifermentans"                        
#> [3] "Eisenbergiella massiliensis"                           
#> [4] "Acidaminococcus intestini"                             
#> [5] "Parasutterella excrementihominis"
select_coxph$balance_name$denominator
#>  [1] "Dysosmobacter"                                
#>  [2] "Clostridiales Family XIII. Incertae Sedis"    
#>  [3] "Lachnospira eligens"                          
#>  [4] "Massilimicrobiota timonensis"                 
#>  [5] "Agathobaculum butyriciproducens"              
#>  [6] "[Clostridium] leptum"                         
#>  [7] "Bifidobacterium catenulatum/pseudocatenulatum"
#>  [8] "Anaerobutyricum soehngenii"                   
#>  [9] "Bacteroides xylanisolvens"                    
#> [10] "Eubacterium"                                  
#> [11] "Anaerostipes"                                 
#> [12] "Alistipes putredinis"                         
#> [13] "Eubacterium ventriosum"                       
#> [14] "Bilophila wadsworthia"

And here is the value of the final selected balance of microbiome for each sample:

select_coxph$balance
#>       M-2000       M-2001       M-2002       M-2003       M-2004       M-2005 
#>  0.837486253 -1.406619798 -1.421036705 -0.204348362 -1.228283054 -0.005997298 
#>       M-2006       M-2007       M-2008       M-2009       M-2010       M-2013 
#>  1.779247033 -0.806398850  0.357369533  3.752290061  0.551966669  0.292634946 
#>       M-2014       M-2018       M-2021       M-2024       M-2030       M-2031 
#> -1.228931554 -1.239757883  1.684575012 -0.042165288 -0.679476501 -0.564948424 
#>       M-2035       M-2040       M-2041       M-2043       M-2045       M-2046 
#>  0.393028222 -0.414389042 -0.884891857  0.352006472  0.044044368  0.850689152 
#>       M-2049       M-2050       M-2051       M-2054       M-2056       M-2059 
#>  0.792721982 -0.434479547  0.104701085 -1.269892983 -0.326736863 -0.368146020 
#>       M-2062       M-2064       M-2066       M-2067       M-2068       M-2069 
#>  0.715060166 -0.646442335  0.191908746 -1.174924149  1.220369196  0.676698660 
#>       M-2074       M-2077       M-2079       M-2082       M-2084       M-2085 
#> -0.248523821 -0.221243932 -0.732856713 -0.402435682 -0.362810395  0.165101075 
#>       M-2087       M-2090       M-2091       M-2092       M-2094       M-2095 
#> -0.073229491  0.044846394  1.492174624 -1.665548697 -1.025799358 -0.179534268 
#>       M-2096       M-2098       M-2099       M-2101       M-2108       M-2109 
#> -0.818456580  2.142694112 -0.430141030  1.974103121 -0.609331350  0.921126209 
#>       M-2111       M-2115       M-2122       M-2123       M-2124       M-2125 
#>  0.600699874  0.183049924 -0.749427856  0.550366098 -0.667610499 -1.546930652 
#>       M-2128       M-2168       M-2169 
#>  0.587784622 -0.647886850 -0.233900172

Finally, we can visualize the survival plot with respect to several quantiles of the final selected balance of microbiome with 1-alpha confidence interval. The quantiles of the final selected balance for plot are specified through the argument quantile_plotted, of which the default value is c(0.25, 0.75). The default value of alpha is 0.05.

select_coxph$survival_plot

From the plot, we can see that the 25th percentile (lower quartile) of the sample microbiome balance is -0.6735, and the 75th percentile (upper quartile) is 0.5512. The survival curves for the two corresponding values of the microbiome balance are shown in the plot.

Select balance based on the parametric survival model

We can also select the balance based on the parametric survival model by setting model = "parametric" , and specify the distribution via the argument dist.

select_weibull = balance_selection(
  Surv_obj = gvhd,
  data = bacteria,
  mult_repl = FALSE,
  model = "parametric",
  dist = "weibull",
  selection_criterion = "min_decrement_pvalue"
)

Similarly, We can check the omnibus p-value from the global community-level association test in global_p:

select_coxph$global_p
#> [1] 0.5105105

we can find the selection path in selection_path:

select_weibull$selection_path[,1:2]
#>                                           taxon_id sign
#> [Ruminococcus] faecis                            8    1
#> Dorea                                          102    0
#> Clostridiales Family XIII. Incertae Sedis       90    1
#> Anaerotignum lactatifermentans                  95    0
#> Coprococcus                                     14    1
#> Acidaminococcus intestini                       43    0
#> Neglecta                                        29    0
#> Anaerobutyricum soehngenii                     129    1
#> Intestinimonas v2                               36    0
#> Massilimicrobiota timonensis                    81    1
#> Longicatena caecimuris                          75    0
#> Anaerobutyricum hallii                         131    0
#> Coprococcus comes                              112    1
#> Dysosmobacter                                   33    1
#> Agathobaculum butyriciproducens                100    1

We can print the final fitted parametric survival model for the survival outcome, gvhd, with the selected balance:

select_weibull$survival_model
#> Call:
#> survreg(formula = Surv_obj ~ ., data = df, dist = dist)
#> 
#> Coefficients:
#> (Intercept)     balance 
#>    5.769075    1.650078 
#> 
#> Scale= 1.199659 
#> 
#> Loglik(model)= -258.6   Loglik(intercept only)= -295.3
#>  Chisq= 73.41 on 1 degrees of freedom, p= <2e-16 
#> n= 63

We can check which taxa are included in the numerator of the selected balance, as well as taxa in the denominator of the selected balance:

select_weibull$balance_name
#> $numerator
#> [1] "[Ruminococcus] faecis"                    
#> [2] "Clostridiales Family XIII. Incertae Sedis"
#> [3] "Coprococcus"                              
#> [4] "Anaerobutyricum soehngenii"               
#> [5] "Massilimicrobiota timonensis"             
#> [6] "Coprococcus comes"                        
#> [7] "Dysosmobacter"                            
#> [8] "Agathobaculum butyriciproducens"          
#> 
#> $denominator
#> [1] "Dorea"                          "Anaerotignum lactatifermentans"
#> [3] "Acidaminococcus intestini"      "Neglecta"                      
#> [5] "Intestinimonas v2"              "Longicatena caecimuris"        
#> [7] "Anaerobutyricum hallii"

And here is the value of the final selected balance of microbiome for each sample:

select_weibull$balance
#>      M-2000      M-2001      M-2002      M-2003      M-2004      M-2005 
#>  0.00000000  2.04057566 -0.01319373 -0.75699431  0.93550620  1.35663561 
#>      M-2006      M-2007      M-2008      M-2009      M-2010      M-2013 
#> -0.47998351  0.65783627 -1.05440428 -3.55409033 -1.73662766 -0.70293243 
#>      M-2014      M-2018      M-2021      M-2024      M-2030      M-2031 
#>  1.15979605  1.34995970 -0.86751417 -1.16936496  2.08546971 -0.62975327 
#>      M-2035      M-2040      M-2041      M-2043      M-2045      M-2046 
#> -0.72669106 -1.60148802  1.16110388  0.25503500 -1.32013808 -0.93820266 
#>      M-2049      M-2050      M-2051      M-2054      M-2056      M-2059 
#> -0.14414414 -0.17656269 -1.26209604  1.16819286 -1.25011071  0.41197961 
#>      M-2062      M-2064      M-2066      M-2067      M-2068      M-2069 
#> -1.38818211 -0.04957122 -1.20049269 -0.59210268 -1.56215146 -1.02600843 
#>      M-2074      M-2077      M-2079      M-2082      M-2084      M-2085 
#> -0.31388923  0.27465307  1.09060275 -1.27005905  0.11770846 -0.61724786 
#>      M-2087      M-2090      M-2091      M-2092      M-2094      M-2095 
#> -0.06916086 -1.72846110 -1.79710092  1.03588867  1.24154467 -0.43826587 
#>      M-2096      M-2098      M-2099      M-2101      M-2108      M-2109 
#> -0.22965446 -0.95661076 -0.17400633 -1.52564103 -0.40173157 -0.95808157 
#>      M-2111      M-2115      M-2122      M-2123      M-2124      M-2125 
#> -1.87655712  0.05168391  0.11770846 -1.57673097  0.07865761  1.31621642 
#>      M-2128      M-2168      M-2169 
#> -0.80668017 -1.71446159  0.27465307

Finally, we can visualize the survival plot:

select_weibull$survival_plot