This is a jupyter notebook example of how to sort, filter and handle sample metadata
In [1]:
import calour as ca
ca.set_log_level(11)
%matplotlib notebook
/Users/amnon/miniconda3/envs/calour/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
we use two datasets:
the Chronic faitigue syndrome data from:
Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.
Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.
Microbiome, 4(1), p.30.
In [2]:
cfs=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=10000,min_reads=1000)
2018-03-04 12:38:34 INFO loaded 87 samples, 2129 features
2018-03-04 12:38:34 WARNING These have metadata but do not have data - dropped: {'ERR1331814'}
2018-03-04 12:38:34 INFO After filtering, 87 remaining
In [3]:
print(cfs)
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features
Moving pictures dataset. from:
Caporaso, J.G., Lauber, C.L., Costello, E.K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., Knights, D., Gajer, P., Ravel, J., Fierer, N. and Gordon, J.I., 2011.
Moving pictures of the human microbiome.
Genome biology, 12(5), p.R50.
In [4]:
movpic=ca.read_amplicon('data/moving_pic.biom',
'data/moving_pic.sample.txt',
normalize=10000,min_reads=1000)
2018-03-04 12:38:36 INFO loaded 1968 samples, 7056 features
2018-03-04 12:38:37 INFO After filtering, 1967 remaining
In [5]:
print(movpic)
AmpliconExperiment ("moving_pic.biom") with 1967 samples, 7056 features
sort_samples
)¶Sort the samples of the experiment based on the values in the given field.
is the original data sorted by the Subject field?
In [6]:
print(cfs.sample_metadata['Subject'].is_monotonic_increasing)
False
In [7]:
cfs=cfs.sort_samples('Subject')
and is the new data sorted?
In [8]:
print(cfs.sample_metadata['Subject'].is_monotonic_increasing)
True
Keeps the order of the previous fields if values for the new field are tied.
For the moving pictures dataset, we want the data to be sorted by individual, and within each individual to be sorted by timepoint
In [9]:
movpic=movpic.sort_samples('DAYS_SINCE_EXPERIMENT_START')
movpic=movpic.sort_samples('HOST_SUBJECT_ID')
In [10]:
print(movpic.sample_metadata['DAYS_SINCE_EXPERIMENT_START'].is_monotonic_increasing)
False
In [11]:
print(movpic.sample_metadata['HOST_SUBJECT_ID'].is_monotonic_increasing)
True
filter_samples
)¶Keep only samples matching the values we supply for the selected metadata field.
lets keep only samples from participant F4
In [12]:
tt=movpic.filter_samples('HOST_SUBJECT_ID','F4')
print('* original:\n%s\n\n* filtered:\n%s' % (movpic, tt))
* original:
AmpliconExperiment ("moving_pic.biom") with 1967 samples, 7056 features
* filtered:
AmpliconExperiment ("moving_pic.biom") with 534 samples, 7056 features
now lets only keep skin and fecal samples
In [13]:
print(movpic.sample_metadata['BODY_HABITAT'].unique())
['UBERON:skin' 'UBERON:feces' 'UBERON:oral cavity']
In [14]:
yy=tt.filter_samples('BODY_HABITAT', ['UBERON:skin', 'UBERON:feces'])
print(yy)
AmpliconExperiment ("moving_pic.biom") with 399 samples, 7056 features
We use the negate=True
parameter
let’s keep just the non-skin and non-feces samples
In [15]:
yy=tt.filter_samples('BODY_HABITAT', ['UBERON:skin', 'UBERON:feces'], negate=True)
print(yy)
AmpliconExperiment ("moving_pic.biom") with 135 samples, 7056 features
filter_abundance
)¶Remove all features (bacteria) with < 10 reads total (summed over all samples, after normalization).
This is useful for getting rid of non-interesting features. Note that
differently from filtering based of fraction of samples where feature is
present (filter_prevalence
), this method (filter_abundance
) will
also keep features present in a small fraction of the samples, but in
high frequency.
In [16]:
tt=cfs.filter_abundance(25)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
2018-03-04 12:38:44 INFO After filtering, 766 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features
* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 766 features
By default, the function removes the low abundance feature. This can be
reversed (i.e. keep low abundance features) by using the negate=True
parameter)
In [17]:
tt=cfs.filter_abundance(25, negate=True)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs,tt))
2018-03-04 12:38:45 INFO After filtering, 1363 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features
* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 1363 features
filter_prevalence
)¶Remove bacteria based on fraction of the samples where this bacteria is present.
In [18]:
# remove bacteria present in less than half of the samples
tt=cfs.filter_prevalence(0.5)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
2018-03-04 12:38:46 INFO After filtering, 128 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features
* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 128 features
filter_mean
)¶Remove bacteria which have a mean (over all samples) lower than the desired threshold.
In [19]:
# keep only high frequency bacteria (mean over all samples > 1%)
tt=cfs.filter_mean(0.01)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
2018-03-04 12:38:47 INFO After filtering, 19 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features
* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 19 features
In [ ]: