Chronic fatigue syndrome (CFS) is characterized by extreme fatigue that is not well understood. However, there has been reports of gastrointestinal issues in some patients with CFS. In Hanson 16S measures in addition to a series inflammatory markers. What they found is that based on 16S measurements alone they could accurately classify which people had CFS. However, the analyses used was very prone to mis-interpretation. There were multiple instances of Mann-whitney U-tests that were conducted to determine which microbes were differentially abundant, which has been shown to yield false discovery rates up to 100% (see here).

We will rerun these analyses with balance trees to validate these results. Like in the other tutorials we'll want to load up the biom table as a qiime2 compatible object.

In [1]:

```
!qiime tools import --input-path final.withtax.biom \
--output-path table.biom.qza \
--type FeatureTable[Frequency]
!qiime tools import --input-path taxa.txt \
--output-path taxa.qza \
--type FeatureData[Taxonomy]
```

To avoid taking logs of zeros we'll first filter out some of the low abundance OTUs. Here we will filter out OTUs that appear less than 100 times in all of the samples.

In [2]:

```
!qiime feature-table filter-features \
--i-table table.biom.qza \
--o-filtered-table table.filt.biom.qza \
--p-min-frequency 300
```

Then we'll remove all of the zeros by adding a pseudocount to all of the values.

In [3]:

```
!qiime composition add-pseudocount \
--i-table table.filt.biom.qza \
--p-pseudocount 1 \
--o-composition-table composition.biom.qza
```

In the next step, we will define a means to obtain principal balances. Principal balances is a way to represent maximally distinct partitions of features. Here we will define the partitions of microbes using Ward hierarchical clustering. This will cluster together microbes that commonly co-occur with each other using the following correlation metric.

$$d(x, y) = V [ \ln \frac{x}{y} ]$$Where $x$ and $y$ represent the proportions of two microbes across all of the samples. If two microbes are highly correlated, then this quantity will shrink close to zero. Ward hierarchical cluster will then use this distance metric to iteratively cluster together groups of microbes that are correlated with each other. In the end, the tree that we obtain will highlight the high level structure and identify the blocks within in the data.

In [4]:

```
!qiime gneiss correlation-clustering \
--i-table composition.biom.qza \
--o-clustering correlated_hierarchy.nwk.qza
```

In [5]:

```
!qiime gneiss ilr-transform \
--i-table composition.biom.qza \
--i-tree correlated_hierarchy.nwk.qza \
--o-balances balances.qza
```

Once we have obtained a means to partition the features, we can now run linear regression. In this module, the abundances will be converted to principal balances using the partition scheme that we defined earlier. The linear regression that we will be running is called a multivariate response linear regression. This will attempt to predict the microbial abundances based on environmental variables. Running these models has multiple advantages over standard univariate regression, as it avoids many of the issues associated with overfitting, and can gain perspective about community wide perburations based on environmental parameters.

Since the microbial abundances can be mapped directly represented to balances, we can perform this multivariate response directly on the balances. The model that we will be building is represented as follows

$$ \vec{y} = \vec{\beta_0} + \vec{\beta_{Subject}}\vec{X_{subject}} + \vec{\beta_{sex}}\vec{X_{sex}} + \vec{\beta_{age}}\vec{X_{Age}} + \vec{\beta_{sCD14ugml}}\vec{X_{sCD14ugml}} + \vec{\beta_{LBPugml}}\vec{X_{LBPugml}}$$Where $\vec{y}$ represents the matrix of balances to be predicted, $\vec{\beta_i}$ represents a vector of coefficients for covariate $i$ and $\vec{X_i}$ represents the measures for covariate $i$.

Remember that ANOVA is a special case of linear regression - every problem that can be solved by ANOVA can be reformulated as a linear regression. See this post on cross-validated for most details. So we can still answer the same sort of differential abundance questions using this technique, but we can start asking more precise questions, controlling for different potential confounding variables or even interaction effects.

In [6]:

```
!qiime gneiss ols-regression \
--p-formula "Subject+Sex+Age+BMI+sCD14ugml+LBPugml+LPSpgml" \
--i-table balances.qza \
--i-tree correlated_hierarchy.nwk.qza \
--m-metadata-file map.txt \
--o-visualization regression_summary.qzv
```

Now we have a summary of the regression model. Specifically we want to see which covariates impact the model the most, which balances are meaningful, and how much potential overfitting is going on. After we generate the regression summaries, we'll want to visualize them in view.qiime2.org

There are a few things to note in the regression summary. There is an $R^2$ in the summary, which gives information about the about of variance in the community is explained by the regression model. From what we can see, the regression can explain about 10% of the community. This is typical for what we see in human gut microbes, since there is a very high amount of confounding variation due to genetics, diet, environment, ...

Next, we have a heatmap visualizing all of the coefficient pvalues for all of the balances. The heatmap is colored by the negative log of the pvalue, highlighting potentially significant pvalues. A hovertool is enabled to allow for specific coefficient values and their corresponding pvalues to be obtained.

Below are cross validation plots, which given an idea about the within model error, and the prediction error. A leave-one-out cross validation scheme was used, where each sample is left out of the training step, and is predicted. In this case there were 85 cross validations performed, since there were 85 samples. From here, we can see that while the variance in the model error is much smaller than the prediction error, the mean model error is roughly the same as the mean prediction error, which is reassuring that there isn't much overfitting happening.

Next are the prediction and residual plots. Here, only the top two balances are plotted, and the prediction residuals from the model are projected onto these two balances. From these plots we can see that the predicted points lie within the same region as the original communities. However, we can see that the residuals have roughly the same variance as the predictions. This is a little unsettling - but note that we can only explain 10% of the community variance, so these sorts of calculations aren't completely unexpected.

The branch lengths in the visualized tree are also scaled by the explained sum of squares in the models. The longest branch lengths correspond to the most informative balances. This can allow us to get a high level overview of the most important balances in the model. From this plot and the above heatmap, we can see that balance $y0$ is important. These balances not only have very small pvalues for differentiating subjects, but they also have the largest branch lengths in the tree diagram. This suggests that these two partitions of microbes could differentiate the CFS patients from the controls.

Let's extract the balances and tree to see how large this effect size truly is.

In [7]:

```
!qiime gneiss balance-taxonomy \
--i-balances balances.qza \
--i-tree correlated_hierarchy.nwk.qza \
--i-taxonomy taxa.qza \
--p-taxa-level 2 \
--p-balance-name 'y0' \
--m-metadata-file map.txt \
--m-metadata-category Subject \
--o-visualization y0_taxa_summary.qzv --verbose
```

In the Python tutorial, we'll discuss how to dive into these results in a more fine tune manner. Specifically, we'll also want extract the regression results, taxonomies and metadata to build the heatmaps, boxplots and interactive trees.