Microbiome data manipulation tutorial

This is a jupyter notebook example of how to sort, filter and handle sample metadata

Setup

In [1]:
import calour as ca
ca.set_log_level(11)
%matplotlib notebook
/Users/amnon/miniconda3/envs/calour/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Load the data

we use two datasets:

the Chronic faitigue syndrome data from:

Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.

Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.

Microbiome, 4(1), p.30.

In [2]:
cfs=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
                     'data/chronic-fatigue-syndrome.sample.txt',
                     normalize=10000,min_reads=1000)
2018-03-04 12:38:34 INFO loaded 87 samples, 2129 features
2018-03-04 12:38:34 WARNING These have metadata but do not have data - dropped: {'ERR1331814'}
2018-03-04 12:38:34 INFO After filtering, 87 remaining
In [3]:
print(cfs)
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features

Moving pictures dataset. from:

Caporaso, J.G., Lauber, C.L., Costello, E.K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., Knights, D., Gajer, P., Ravel, J., Fierer, N. and Gordon, J.I., 2011.

Moving pictures of the human microbiome.

Genome biology, 12(5), p.R50.

In [4]:
movpic=ca.read_amplicon('data/moving_pic.biom',
                     'data/moving_pic.sample.txt',
                     normalize=10000,min_reads=1000)
2018-03-04 12:38:36 INFO loaded 1968 samples, 7056 features
2018-03-04 12:38:37 INFO After filtering, 1967 remaining
In [5]:
print(movpic)
AmpliconExperiment ("moving_pic.biom") with 1967 samples, 7056 features

sorting the samples based on a metadata field (sort_samples)

Sort the samples of the experiment based on the values in the given field.

is the original data sorted by the Subject field?

In [6]:
print(cfs.sample_metadata['Subject'].is_monotonic_increasing)
False
In [7]:
cfs=cfs.sort_samples('Subject')

and is the new data sorted?

In [8]:
print(cfs.sample_metadata['Subject'].is_monotonic_increasing)
True

consecutive sorting using different fields

Keeps the order of the previous fields if values for the new field are tied.

For the moving pictures dataset, we want the data to be sorted by individual, and within each individual to be sorted by timepoint

In [9]:
movpic=movpic.sort_samples('DAYS_SINCE_EXPERIMENT_START')
movpic=movpic.sort_samples('HOST_SUBJECT_ID')
In [10]:
print(movpic.sample_metadata['DAYS_SINCE_EXPERIMENT_START'].is_monotonic_increasing)
False
In [11]:
print(movpic.sample_metadata['HOST_SUBJECT_ID'].is_monotonic_increasing)
True

filter samples based on metadata field (filter_samples)

Keep only samples matching the values we supply for the selected metadata field.

lets keep only samples from participant F4

In [12]:
tt=movpic.filter_samples('HOST_SUBJECT_ID','F4')
print('* original:\n%s\n\n* filtered:\n%s' % (movpic, tt))
* original:
AmpliconExperiment ("moving_pic.biom") with 1967 samples, 7056 features

* filtered:
AmpliconExperiment ("moving_pic.biom") with 534 samples, 7056 features

we can supply a list of values instead of only one value

now lets only keep skin and fecal samples

In [13]:
print(movpic.sample_metadata['BODY_HABITAT'].unique())
['UBERON:skin' 'UBERON:feces' 'UBERON:oral cavity']
In [14]:
yy=tt.filter_samples('BODY_HABITAT', ['UBERON:skin', 'UBERON:feces'])
print(yy)
AmpliconExperiment ("moving_pic.biom") with 399 samples, 7056 features

we can also reverse the filtering (removing samples with the supplied values)

We use the negate=True parameter

let’s keep just the non-skin and non-feces samples

In [15]:
yy=tt.filter_samples('BODY_HABITAT', ['UBERON:skin', 'UBERON:feces'], negate=True)
print(yy)
AmpliconExperiment ("moving_pic.biom") with 135 samples, 7056 features

filter low abundance features (filter_abundance)

Remove all features (bacteria) with < 10 reads total (summed over all samples, after normalization).

This is useful for getting rid of non-interesting features. Note that differently from filtering based of fraction of samples where feature is present (filter_prevalence), this method (filter_abundance) will also keep features present in a small fraction of the samples, but in high frequency.

In [16]:
tt=cfs.filter_abundance(25)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
2018-03-04 12:38:44 INFO After filtering, 766 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features

* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 766 features

Keeping the low abundance bacteria instead

By default, the function removes the low abundance feature. This can be reversed (i.e. keep low abundance features) by using the negate=True parameter)

In [17]:
tt=cfs.filter_abundance(25, negate=True)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs,tt))
2018-03-04 12:38:45 INFO After filtering, 1363 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features

* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 1363 features

filter non-common bacteria (filter_prevalence)

Remove bacteria based on fraction of the samples where this bacteria is present.

In [18]:
# remove bacteria present in less than half of the samples
tt=cfs.filter_prevalence(0.5)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
2018-03-04 12:38:46 INFO After filtering, 128 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features

* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 128 features

Filter bacteria based on the mean frequency over all samples (filter_mean)

Remove bacteria which have a mean (over all samples) lower than the desired threshold.

In [19]:
# keep only high frequency bacteria (mean over all samples > 1%)
tt=cfs.filter_mean(0.01)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
2018-03-04 12:38:47 INFO After filtering, 19 remaining
* original:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features

* filtered:
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 19 features
In [ ]: