In this study, there were 88 soil samples collected from North and South America. Each of these soil samples had a pH measurement. From this paper, it was observed that pH plays a primary role in community assemblage. Here, we will show that pH by itself can actually predict the microbial abundances, using balance trees.

In [1]:
import pandas as pd

In [2]:
!qiime tools import \
--input-path 238_otu_table.biom \
--output-path 88soils.biom.qza \
--type FeatureTable[Frequency]

!qiime tools import \
--input-path 88soils_taxonomy.txt \
--output-path 88soils_taxonomy.qza \
--type FeatureData[Taxonomy]


Now we will be filtering out OTUs that are lower abundance. We'll set the threshold to be 100 reads. This is because there is going to be a lot of garbage OTUs due to contamination, sequencing error, or clustering errors. We feel that the 100 read filter is conservative enough to demonstrate the utility of this tool. Of course, this threshold depends on the dataset, and we suggest tinkering around with this threshold.

Samples and OTUS with fewer than 100 reads can be filtered as follows.

In [3]:
!qiime feature-table filter-features \
--i-table 88soils.biom.qza \
--o-filtered-table 88soils_filt100.biom.qza \
--p-min-frequency 100

Saved FeatureTable[Frequency] to: 88soils_filt100.biom.qza


Now we will build a tree based on the pH values, since in previous work it was observed that pH was the major driving factor behind the community dissimilarity.

In [4]:
!qiime gneiss gradient-clustering \
--i-table 88soils_filt100.biom.qza \
--o-clustering ph_tree.nwk.qza \
--p-no-weighted

Saved Hierarchy to: ph_tree.nwk.qza


The tree will give us information about how the OTUs partition. In this case, the tree groups together organisms that live in similar pH environments. To get a better idea about how these organisms, let's visualize these partitions on a heatmap. To this let's first categorize the samples by their measured pH.

Before running the regression, we have to account for zero abundances. Due the nature of zeros, we cannot be certain if the zeros arose from undersampling, or the complete absence of an OTU. To this extent, we'll add a pseudocount of 1 to approximate the uncertainity probability. We'll also want this for visualizing the heatmaps, since we'll be doing some log scaling.

In [5]:
!qiime composition add-pseudocount \
--i-table 88soils_filt100.biom.qza \
--p-pseudocount 1 \
--o-composition-table 88soils_composition.biom.qza

Saved FeatureTable[Composition] to: 88soils_composition.biom.qza


Now we can generate a heatmap using these discretized pH values.

In [6]:
!qiime gneiss dendrogram-heatmap \
--i-table 88soils_composition.biom.qza \
--i-tree ph_tree.nwk.qza \
--o-visualization "ph_heatmap" \
--p-ndim 10 --verbose

Saved Visualization to: ph_heatmap.qzv


Based on this, we can observe that there are clear cut clustering patterns of organisms with respect to pH.
We would like to run statistical tests that can determine how strong the correlation between the partitions and the pH is. Thats where balances can help us. To do this, we will calculate balances and perform an ordinary least squares regression on the transformed data.

We'll want to fit a quartic linear model to each of the balances individually with respect to pH. This model was chosen, because in the original paper, there was a horseshoe shape observed. So, it would make sense to use a parabolic function to fit each balance. Empirically, we found that a 4th degree polynomial gave the best results.

For now, we'll define 3 more variables encoding different powers of pH as follows.

Now we can run the linear regression using ordinary least squares regression. All of the covariates (i.e. ph, ph2, ph3, ph4) can be encoded in the regression model. Note that there is no dependent variable defined, since this regression is run individually against all of the balances specified by otu_table and ph_tree.

In [7]:
!qiime gneiss ilr-transform \
--i-table 88soils_composition.biom.qza \
--i-tree ph_tree.nwk.qza \
--o-balances 88soils_balances.qza

Saved FeatureTable[Balance] to: 88soils_balances.qza

In [8]:
!qiime gneiss ols-regression \
--p-formula "ph + ph2 + ph3 + ph4" \
--i-table 88soils_balances.qza \
--i-tree ph_tree.nwk.qza \
--o-visualization 88soils_regression_summary.qzv

Saved Visualization to: 88soils_regression_summary.qzv

In [9]:
!qiime gneiss balance-taxonomy \
--i-balances 88soils_balances.qza \
--i-tree ph_tree.nwk.qza \
--i-taxonomy 88soils_taxonomy.qza \
--p-taxa-level 2 \
--p-balance-name 'y0' \

Saved Visualization to: y0_taxa_summary.qzv