calour.training.classify

calour.training.classify(exp: calour.experiment.Experiment, fields, estimator, cv=<sklearn.model_selection._split.RepeatedStratifiedKFold object>, predict='predict_proba', params=None)[source]

Evaluate classification during cross validation.

Note

This function is also available as a class method Experiment.classify()

Parameters:
  • exp (Experiment) – Input experiment object.
  • fields (str or list of str) – column name(s) in the sample metadata, which contains the classes we want to predict. If it is a list of str, this function does multi-task (aka multioutput-multiclass) classification and you must provide an estimator of multi-task classifier. See http://scikit-learn.org/stable/modules/multiclass.html for more information.
  • estimator (estimator object implementing fit and predict) – scikit-learn estimator. e.g. sklearn.ensemble.RandomForestClassifer
  • cv (int, cross-validation generator or an iterable) – similar to the cv parameter in sklearn.model_selection.GridSearchCV
  • predict ({'predict', 'predict_proba'}) – the function used to predict the validation sets. Some estimators have both functions to predict class or predict the probablity of each class for a sample. For example, see sklearn.ensemble.RandomForestClassifier
  • params (dict of string to sequence, or sequence of such) – For example, the output of sklearn.model_selection.ParameterGrid or sklearn.model_selection.ParameterSampler. By default, it uses whatever default parameters of the estimator set in scikit-learn
Yields:

pandas.DataFrame – The result of prediction per sample for a given parameter set. It contains the following columns:

  • Y_TRUE: the true class for the samples
  • SAMPLE: sample IDs
  • CV: which split of the cross validation
  • Y_PRED: the predicted class for the samples (if “predict”)
  • mutliple columns with each contain probabilities predicted as each class (if “predict_proba”)