gneiss.cluster.gradient_linkage¶
-
gneiss.cluster.
gradient_linkage
(X, y, method=’average’)[source]¶ Hierarchical Clustering on known gradient.
The hierarchy is built based on the values of the samples located along a gradient. Given a feature \(x\), the mean gradient values that \(x\) was observed in is calculated by
\[f(g , x) = \sum\limits_{i=1}^N g_i \frac{x_i}{\sum\limits_{j=1}^N x_j}\]Where \(N\) is the number of samples, \(x_i\) is the proportion of feature \(x\) in sample \(i\), \(g_i\) is the gradient value at sample i.
The distance between two features \(x\) and \(y\) can be defined as
\[d(x, y) = (f(g, x) - f(g, y))^2\]If \(d(x, y)\) is very small, then \(x\) and \(y\) are expected to live in very similar positions across the gradient. A hierarchical clustering is then performed using \(d(x, y)\) as the distance metric.
This can be useful for constructing principal balances.
Parameters: - X (pd.DataFrame) – Contingency table where the samples are rows and the features are columns.
- y (pd.Series) – Continuous vector representing some ordering of the samples in X.
- method (str) – Clustering method. (default=’average’)
Returns: Tree for constructing principal balances.
Return type: skbio.TreeNode
See also
mean_niche_estimator()
Examples
>>> import pandas as pd >>> from gneiss.cluster import gradient_linkage >>> table = pd.DataFrame([[1, 1, 0, 0, 0], ... [0, 1, 1, 0, 0], ... [0, 0, 1, 1, 0], ... [0, 0, 0, 1, 1]], ... columns=['s1', 's2', 's3', 's4', 's5'], ... index=['o1', 'o2', 'o3', 'o4']).T >>> gradient = pd.Series([1, 2, 3, 4, 5], ... index=['s1', 's2', 's3', 's4', 's5']) >>> tree = gradient_linkage(table, gradient) >>> print(tree.ascii_art()) /-o1 /y1------| | \-o2 -y0------| | /-o3 \y2------| \-o4