This is a jupyter notebook example of how to load, process and plot data from a microbiome experiment using Calour.
In [1]:
import calour as ca
/Users/amnon/miniconda3/envs/calour/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
can use:
The Calour default is warning (21)
In [2]:
ca.set_log_level(11)
In [3]:
%matplotlib notebook
For an amplicon experiment we use ca.read_amplicon()
First parameter is the location+name of the biom table file (can be hdf5/json/txt biom table - see here for details)
Second (optional) parameter is the sample mapping file locaion+name. First column should be the sample id (identical to the sample ids in the biom table). Rest of the column are information fields about each sample.
normalize=XXX : tells calour to rescale each sample to XXX reads (by dividing each feature frequency by the total number of reads in the sample and multiplying by XXX). Alternatively, can use normalize=None to skip normalization (i.e. in the case the biom table is already rarified)
min_reads=XXX : throw away samples with less than min_reads total (before normalization). Useful to get rid of samples with small number of reads. Can use min_reads=None to keep all samples.
See here for all possible parameters for read_amplicon()
We will use the data from:
Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.
Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.
Microbiome, 4(1), p.30.
In [4]:
dat=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=10000,min_reads=1000)
2018-03-04 12:40:24 INFO loaded 87 samples, 2129 features
2018-03-04 12:40:24 WARNING These have metadata but do not have data - dropped: {'ERR1331814'}
2018-03-04 12:40:24 INFO After filtering, 87 remaining
In [5]:
print(dat)
AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features
We throw away all features with total reads (over all samples) < 10 (after each sample was normalized to 10k reads/sample). So a bacteria present (with 1 read) in 10 samples will be kept, as well as a bacteria present in only one sample, with 10 reads in this sample. Note alternatively we could filter based on mean reads/sample or fraction of samples where the feature is present. Each method filters away slightly different bacteria. See filtering notebook for details on the filtering functions.
In [6]:
dat=dat.filter_abundance(10)
2018-03-04 12:40:30 INFO After filtering, 1100 remaining
Features are clustered (hierarchical clustering) based on euaclidian distance between features (over all samples) following normalizing each feature to mean 0 std 1. For more details and examples, see sorting notebook or cluster_features documentation
In [7]:
datc=dat.cluster_features()
2018-03-04 12:40:30 INFO After filtering, 1100 remaining
Note that order within each group of similar value is maintained. We first sort by physical functioning, then sort by the disease state. So within each disease state, samples will still be sorted by physical functioning.
In [8]:
datc=datc.sort_samples('Physical_functioning')
datc=datc.sort_samples('Subject')
Columns (x-axis) are the samples, rows (y-axis) are the features. We will show on the x-axis the host-individual field of each sample.
we will use the jupyter notebook GUI so we will see the interactive plot in the notebook. Alternatively we could use the qt5 GUI to see the plot in a separate standalone window.
A few cool things we can do with the interactive plot:
See here for more details
In [9]:
datc.plot(sample_field='Subject', gui='jupyter')
Out[9]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1093a7630>
Now let’s add the values of the “Sex” field into the xbar on top First we’ll also sort by sex, so values will be continuous (note we then sort by the disease state to get the two groups separated).
In [10]:
datc=datc.sort_samples('Sex')
datc=datc.sort_samples('Subject')
In [11]:
datc.plot(sample_field='Subject', gui='jupyter',barx_fields=['Sex'])
Out[11]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a19dacc50>
Let’s look for bacteria separating sick from healthy We ask it to find all bacteria significantly different between samples with ‘Control’ and ‘Patient’ in the ‘Subject’ field.
By default calour uses the mean of the ranks of each feature (over all samples), with dsFDR multiple hypothesis correction.
For more information, see notebook and function doc
In [12]:
dd=datc.diff_abundance(field='Subject',val1='Control',val2='Patient', random_seed=2018)
2018-03-04 12:40:36 INFO 87 samples with both values
2018-03-04 12:40:36 INFO After filtering, 1100 remaining
2018-03-04 12:40:36 INFO 39 samples with value 1 (['Control'])
2018-03-04 12:40:37 INFO method meandiff. number of higher in ['Control'] : 44. number of higher in ['Patient'] : 19. total 63
Let’s plot to see the behavior of these bacteria. The output of diff_abundance is an Experiment with only the significant bacteria, which are sorted by the effect size. On the bottom is the bacteria with the largest effect size (higher in Control compared to Patient).
In [13]:
dd.plot(sample_field='Subject', gui='jupyter')
Out[13]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a0ffeca20>
We can ask what is special in the bacteria significanly higher in the Control vs. the Patient group and vice versa.
We supply the parameter ignore_exp=[12]
to ignore annotations
regarding this experiment (expid=12) since it is already in the dbBact
database.
In [14]:
ax, enriched=dd.plot_diff_abundance_enrichment(term_type='combined',ignore_exp=[12])
2018-03-04 12:40:41 INFO Getting dbBact annotations for 63 sequences, please wait...
2018-03-04 12:40:45 INFO Got 2383 annotations
2018-03-04 12:40:45 INFO Added annotation data to experiment. Total 631 annotations, 63 terms
2018-03-04 12:40:45 INFO removed 171 terms
The enriched terms are in a calour experiment class (terms are features, bacteria are samples), so we can see the list of enriched terms with the p-value (pval) and effect size (odif)
In [15]:
enriched.feature_metadata
Out[15]:
odif | pvals | term | |
---|---|---|---|
crohn's disease | -1.688995 | 0.000999 | crohn's disease |
**63**little physical activity | -1.368421 | 0.000999 | **63**little physical activity |
-control | -1.038876 | 0.000999 | -control |
age > 1 year | -1.014354 | 0.010989 | age > 1 year |
-small village | -0.997608 | 0.000999 | -small village |
age | -0.931873 | 0.060939 | age |
obsolete_juvenile stage | -0.898923 | 0.000999 | obsolete_juvenile stage |
child | -0.878190 | 0.006993 | child |
animal product diet | -0.825359 | 0.005994 | animal product diet |
-rural community | -0.803828 | 0.000999 | -rural community |
**63**-physical activity | -0.684211 | 0.000999 | **63**-physical activity |
higher in individuals with low physical activity ( high in little physical activity compared to physical activity in feces homo sapiens united states of america | -0.684211 | 0.000999 | higher in individuals with low physical activi... |
-adult | -0.674641 | 0.000999 | -adult |
**276**-tunapuco | -0.593301 | 0.000999 | **276**-tunapuco |
high in united states of america city state of oklahoma compared to peru small village tunapuco rural community in feces homo sapiens adult | -0.593301 | 0.000999 | high in united states of america city state o... |
**276**-peru | -0.593301 | 0.000999 | **276**-peru |
old age | -0.586124 | 0.007992 | old age |
high in children with Crohn's disease compared to healthy adult controls ( high in crohn's disease child obsolete_juvenile stage compared to control adult in feces homo sapiens glasgow | -0.578947 | 0.000999 | high in children with Crohn's disease compared... |
high in female compared to male in feces homo sapiens united states of america | -0.473684 | 0.000999 | high in female compared to male in feces ho... |
female | -0.466467 | 0.065934 | female |
infant | -0.452602 | 0.006993 | infant |
**131**age 30-50 years | -0.435407 | 0.025974 | **131**age 30-50 years |
canis lupus familiaris | -0.431685 | 0.000999 | canis lupus familiaris |
**150**enzyme supplement | -0.421053 | 0.006993 | **150**enzyme supplement |
state of oklahoma | -0.360048 | 0.008991 | state of oklahoma |
Higher in animal product diet compared to plant diet ( high in diet animal product diet compared to plant diet in feces homo sapiens united states of america | -0.352871 | 0.001998 | Higher in animal product diet compared to plan... |
**74**-plant diet | -0.352871 | 0.001998 | **74**-plant diet |
high in age 1 year compared to age 2 months in feces homo sapiens female infant state of california | -0.345694 | 0.001998 | high in age 1 year compared to age 2 months ... |
**284**-age 2 months | -0.345694 | 0.001998 | **284**-age 2 months |
**214**taconic farms | -0.315789 | 0.025974 | **214**taconic farms |
... | ... | ... | ... |
high in peru small village tunapuco rural community compared to united states of america city state of oklahoma in feces homo sapiens adult | 0.500000 | 0.000999 | high in peru small village tunapuco rural com... |
lower in small intestine compared to colon in pigs ( high in caecum left colon right colon compared to duodenum jejunum ileum in sus scrofa united kingdom pig | 0.500000 | 0.001998 | lower in small intestine compared to colon in ... |
**256**right colon | 0.543062 | 0.000999 | **256**right colon |
**256**left colon | 0.543062 | 0.000999 | **256**left colon |
lower in babies from finland compared to estonia ( high in estonia compared to finland in feces homo sapiens infant age < 3 years | 0.545455 | 0.000999 | lower in babies from finland compared to eston... |
**63**physical activity | 0.545455 | 0.017982 | **63**physical activity |
common caecum, left colon, right colon, sus scrofa, united kingdom, pig, | 0.561005 | 0.000999 | common caecum, left colon, right colon, sus s... |
common feces, homo sapiens, child, egypt, obsolete_juvenile stage, | 0.570574 | 0.000999 | common feces, homo sapiens, child, egypt, obs... |
**276**peru | 0.575758 | 0.000999 | **276**peru |
**276**tunapuco | 0.575758 | 0.000999 | **276**tunapuco |
-city | 0.590909 | 0.000999 | -city |
high in male compared to female in feces homo sapiens united states of america | 0.613636 | 0.000999 | high in male compared to female in feces ho... |
lower in wastewater plant effluent compared to influent and sewer in south america ( high in sewage influent compared to effluent in city wastewater treatment plant south america | 0.637560 | 0.000999 | lower in wastewater plant effluent compared to... |
-effluent | 0.637560 | 0.000999 | -effluent |
common feces, homo sapiens, venezuela, amerindian, hunter gatherer, | 0.674641 | 0.000999 | common feces, homo sapiens, venezuela, amerin... |
common feces, homo sapiens, adult, peru, small village, tunapuco, rural community, | 0.704545 | 0.000999 | common feces, homo sapiens, adult, peru, smal... |
rural community | 0.730463 | 0.000999 | rural community |
male | 0.735845 | 0.026973 | male |
-finland | 0.773325 | 0.000999 | -finland |
-united states of america | 0.795455 | 0.000999 | -united states of america |
common feces, homo sapiens, city, el salvador, small village, | 0.795455 | 0.000999 | common feces, homo sapiens, city, el salvador... |
-ileum | 0.811005 | 0.000999 | -ileum |
wet season | 0.909091 | 0.000999 | wet season |
-crohn's disease | 1.132775 | 0.000999 | -crohn's disease |
small village | 1.200159 | 0.000999 | small village |
adult | 1.248156 | 0.017982 | adult |
**53**influent | 1.275120 | 0.000999 | **53**influent |
**53**sewage | 1.275120 | 0.000999 | **53**sewage |
homo sapiens | 2.280233 | 0.068931 | homo sapiens |
control | 2.662679 | 0.011988 | control |
239 rows × 3 columns
In [ ]: