gneiss.cluster.gradient_linkage

gneiss.cluster.gradient_linkage(X, y, method=’average’)[source]

Hierarchical Clustering on known gradient.

The hierarchy is built based on the values of the samples located along a gradient. Given a feature \(x\), the mean gradient values that \(x\) was observed in is calculated by

\[f(g , x) = \sum\limits_{i=1}^N g_i \frac{x_i}{\sum\limits_{j=1}^N x_j}\]

Where \(N\) is the number of samples, \(x_i\) is the proportion of feature \(x\) in sample \(i\), \(g_i\) is the gradient value at sample i.

The distance between two features \(x\) and \(y\) can be defined as

\[d(x, y) = (f(g, x) - f(g, y))^2\]

If \(d(x, y)\) is very small, then \(x\) and \(y\) are expected to live in very similar positions across the gradient. A hierarchical clustering is then performed using \(d(x, y)\) as the distance metric.

This can be useful for constructing principal balances.

Parameters:
  • X (pd.DataFrame) – Contingency table where the samples are rows and the features are columns.
  • y (pd.Series) – Continuous vector representing some ordering of the samples in X.
  • method (str) – Clustering method. (default=’average’)
Returns:

Tree for constructing principal balances.

Return type:

skbio.TreeNode

See also

mean_niche_estimator()

Examples

>>> import pandas as pd
>>> from gneiss.cluster import gradient_linkage
>>> table = pd.DataFrame([[1, 1, 0, 0, 0],
...                       [0, 1, 1, 0, 0],
...                       [0, 0, 1, 1, 0],
...                       [0, 0, 0, 1, 1]],
...                      columns=['s1', 's2', 's3', 's4', 's5'],
...                      index=['o1', 'o2', 'o3', 'o4']).T
>>> gradient = pd.Series([1, 2, 3, 4, 5],
...                      index=['s1', 's2', 's3', 's4', 's5'])
>>> tree = gradient_linkage(table, gradient)
>>> print(tree.ascii_art())
                    /-o1
          /y1------|
         |          \-o2
-y0------|
         |          /-o3
          \y2------|
                    \-o4