In this study, there were 88 soil samples collected from North and South America. Each of these soil samples had a pH measurement. From this paper, it was observed that pH plays a primary role in community assemblage. Here, we will show that pH by itself can actually predict the microbial abundances, using balance trees.

In [1]:
import pandas as pd
In [2]:
!qiime tools import \
    --input-path 238_otu_table.biom \
    --output-path 88soils.biom.qza \
    --type FeatureTable[Frequency]

!qiime tools import \
    --input-path 88soils_taxonomy.txt \
    --output-path 88soils_taxonomy.qza \
    --type FeatureData[Taxonomy]

Now we will be filtering out OTUs that are lower abundance. We'll set the threshold to be 100 reads. This is because there is going to be a lot of garbage OTUs due to contamination, sequencing error, or clustering errors. We feel that the 100 read filter is conservative enough to demonstrate the utility of this tool. Of course, this threshold depends on the dataset, and we suggest tinkering around with this threshold.

Samples and OTUS with fewer than 100 reads can be filtered as follows.

In [3]:
!qiime feature-table filter-features \
    --i-table 88soils.biom.qza \
    --o-filtered-table 88soils_filt100.biom.qza \
    --p-min-frequency 100
Saved FeatureTable[Frequency] to: 88soils_filt100.biom.qza

Now we will build a tree based on the pH values, since in previous work it was observed that pH was the major driving factor behind the community dissimilarity.

In [4]:
!qiime gneiss gradient-clustering \
    --i-table 88soils_filt100.biom.qza \
    --m-gradient-file 88soils_metadata.txt \
    --m-gradient-category ph \
    --o-clustering ph_tree.nwk.qza \
    --p-no-weighted
Saved Hierarchy to: ph_tree.nwk.qza

The tree will give us information about how the OTUs partition. In this case, the tree groups together organisms that live in similar pH environments. To get a better idea about how these organisms, let's visualize these partitions on a heatmap. To this let's first categorize the samples by their measured pH.

Before running the regression, we have to account for zero abundances. Due the nature of zeros, we cannot be certain if the zeros arose from undersampling, or the complete absence of an OTU. To this extent, we'll add a pseudocount of 1 to approximate the uncertainity probability. We'll also want this for visualizing the heatmaps, since we'll be doing some log scaling.

In [5]:
!qiime composition add-pseudocount \
    --i-table 88soils_filt100.biom.qza \
    --p-pseudocount 1 \
    --o-composition-table 88soils_composition.biom.qza
Saved FeatureTable[Composition] to: 88soils_composition.biom.qza

Now we can generate a heatmap using these discretized pH values.

In [6]:
!qiime gneiss dendrogram-heatmap \
    --i-table 88soils_composition.biom.qza \
    --i-tree ph_tree.nwk.qza \
    --m-metadata-file 88soils_rounded_metadata.txt \
    --m-metadata-category "ph_rounded" \
    --o-visualization "ph_heatmap" \
    --p-ndim 10 --verbose
Saved Visualization to: ph_heatmap.qzv

Based on this, we can observe that there are clear cut clustering patterns of organisms with respect to pH.
We would like to run statistical tests that can determine how strong the correlation between the partitions and the pH is. Thats where balances can help us. To do this, we will calculate balances and perform an ordinary least squares regression on the transformed data.

We'll want to fit a quartic linear model to each of the balances individually with respect to pH. This model was chosen, because in the original paper, there was a horseshoe shape observed. So, it would make sense to use a parabolic function to fit each balance. Empirically, we found that a 4th degree polynomial gave the best results.

For now, we'll define 3 more variables encoding different powers of pH as follows.

Now we can run the linear regression using ordinary least squares regression. All of the covariates (i.e. ph, ph2, ph3, ph4) can be encoded in the regression model. Note that there is no dependent variable defined, since this regression is run individually against all of the balances specified by otu_table and ph_tree.

In [7]:
!qiime gneiss ilr-transform \
    --i-table 88soils_composition.biom.qza \
    --i-tree ph_tree.nwk.qza \
    --o-balances 88soils_balances.qza
Saved FeatureTable[Balance] to: 88soils_balances.qza
In [8]:
!qiime gneiss ols-regression \
    --p-formula "ph + ph2 + ph3 + ph4" \
    --i-table 88soils_balances.qza \
    --i-tree ph_tree.nwk.qza \
    --m-metadata-file 88soils_modified_metadata.txt \
    --o-visualization 88soils_regression_summary.qzv
Saved Visualization to: 88soils_regression_summary.qzv
In [9]:
!qiime gneiss balance-taxonomy \
    --i-balances 88soils_balances.qza \
    --i-tree ph_tree.nwk.qza \
    --i-taxonomy 88soils_taxonomy.qza \
    --p-taxa-level 2 \
    --p-balance-name 'y0' \
    --m-metadata-file '88soils_modified_metadata.txt' \
    --m-metadata-category 'ph' \
    --o-visualization y0_taxa_summary.qzv
Saved Visualization to: y0_taxa_summary.qzv

These summary results can be directly visualized in qiime2. Checkout view.qiime2.org and try to upload your qzv file theres.

We can see from a high level that some of the balances can describe the succession of low pH thriving microbes and high pH thriving microbes as the pH increases. We'll show in the Python tutorial how to investigate these results in more detail.