gneiss.regression.ols

gneiss.regression.ols(formula, table, metadata)[source]

Ordinary Least Squares applied to balances.

An ordinary least squares (OLS) regression is a method for estimating parameters in a linear regression model. OLS is a common statistical technique for fitting and testing the effects of covariates on a response. This implementation is focused on performing a multivariate response regression where the response is a matrix of balances (table) and the covariates (metadata) are made up of external variables.

Global statistical tests indicating goodness of fit and contributions from covariates can be accessed from a coefficient of determination (r2), leave-one-variable-out cross validation (lovo), leave-one-out cross validation (loo) and k-fold cross validation (kfold). In addition residuals (residuals) can be accessed for diagnostic purposes.

T-statistics (tvalues) and p-values (pvalues) can be obtained to investigate to evaluate statistical significance for a covariate for a given balance. Predictions on the resulting model can be made using (predict), and these results can be interpreted as either balances or proportions.

Parameters:
  • formula (str) – Formula representing the statistical equation to be evaluated. These strings are similar to how equations are handled in R and statsmodels. Note that the dependent variable in this string should not be specified, since this method will be run on each of the individual balances. See patsy for more details.
  • table (pd.DataFrame) – Contingency table where samples correspond to rows and balances correspond to columns.
  • metadata (pd.DataFrame) – Metadata table that contains information about the samples contained in the table object. Samples correspond to rows and covariates correspond to columns.
Returns:

Container object that holds information about the overall fit. This includes information about coefficients, pvalues, residuals and coefficient of determination from the resulting regression.

Return type:

OLSModel

Example

>>> import numpy as np
>>> import pandas as pd
>>> from skbio import TreeNode
>>> from gneiss.regression import ols

Here, we will define a table of balances as follows

>>> np.random.seed(0)
>>> n = 100
>>> g1 = np.linspace(0, 15, n)
>>> y1 = g1 + 5
>>> y2 = -g1 - 2
>>> Y = pd.DataFrame({'y1': y1, 'y2': y2})

Once we have the balances defined, we will add some errors

>>> e = np.random.normal(loc=1, scale=0.1, size=(n, 2))
>>> Y = Y + e

Now we will define the environment variables that we want to regress against the balances.

>>> X = pd.DataFrame({'g1': g1})

Once these variables are defined, a regression can be performed. These proportions will be converted to balances according to the tree specified. And the regression formula is specified to run temp and ph against the proportions in a single model.

>>> res = ols('g1', Y, X)
>>> res.fit()

From the summary results of the ols function, we can view the pvalues according to how well each individual balance fitted in the regression model.

>>> res.pvalues
                      y1             y2
Intercept  8.826379e-148   7.842085e-71
g1         1.923597e-163  1.277152e-163

We can also view the balance coefficients estimated in the regression model. These coefficients can also be viewed as proportions by passing project=True as an argument in res.coefficients().

>>> res.coefficients()
                 y1        y2
Intercept  6.016459 -0.983476
g1         0.997793 -1.000299

The overall model fit can be obtained as follows

>>> res.r2
0.99945903186495066