Microbial community ecology
Overview
Here we introduce that, by using the current genome catalog and reference phylogeny, one can analyze whole-genome shotgun (WGS) sequencing data (a.k.a., metagenomic data) using classical bioinformatic methods designed for amplicon (e.g., 16S rRNA) sequencing data, such as UniFrac for beta diversity, and Faith’s PD for alpha diversity, and more.
With an existing sequence aligner (e.g., Bowtie2 or BLAST), and a Python script developed by us, one can generate a BIOM table, in which every row is an individual genome (here we refer to it as “OGU”), and every column is a sample (microbiome). This BIOM table can be analyzed using a familiar bioinformatic pipeline (e.g., QIIME2) and familiar functions.
The entire analysis is taxonomy-free, although one can always get taxonomic information in parallel (see genome database).
Core notion: OGU
The term OTU (operational taxonomic unit) was conventionally used in 16S data analysis. In practice, sequences are clustered at a similarity threshold of 97%, and each cluster is considered a basic unit of the community. Recent years have seen this classical term evolving into sOTU, where every exact 16S sequence is treated as the basic unit (i.e., an sOTU), hence improving resolution.
sOTU analysis using Deblur or DADA2 is well-supported in QIIME2. See details.
The notion we introduce here, OGU, means to be an analogue to sOTU, but for WGS data analysis. Previously, diversity analyses on WGS data were typically performed at a particular taxonomic rank, such as genus or species. This limits resolution, and introduces artifacts due to the limitation of taxonomic assignment itself. In this protocol, we do NOT assign taxonomy, but directly consider individual sequence-genome association as the basic unit of the microbiome.
Sequence mapping
Any sequence aligner, such as Bowtie2 or BLAST, or more complicated metagenome classifier, such as SHOGUN or Centrifuge, as long as it directly associates query sequences with reference genome sequences, can be used for this analysis. For the later, this process does not block the original analysis (i.e., taxonomic classication). It just makes use of the intermediate files (read maps)
(Evaluation of individual aligners in the context of OGU analysis is currently ongoing.)
Example 1: Taxonomic profiling using SHOGUN, with Bowtie2 as aligner:
shogun align -t 32 -d /path/to/db -a bowtie2 -i input.fa -o .
Example 2: Taxonomic profiling using Centrifuge:
centrifuge -p 32 -x /path/to/db -1 input.R1.fq.gz -2 input.R2.fq.gz -S output.map
OGU table generation
OGU from maps
We provide a Python script ogu_from_maps.py to convert mapping files of multiple samples into a BIOM table (i.e., the OGU table). This script supports multiple mapping file formats, including:
- SAM format (used by Bowtie2, BWA, etc.),
- BLAST tabular output (a.k.a.,
m8
, used by BLAST, USEARCH, BURST, etc.), - Centrifuge mapping file, and
- Simple “query <tab> subject” map.
Place all mapping files in one directory. The stem filenames represent sample IDs. Compressed files (.gz
, .bz2
, .xz
etc.) are acceptable. Then execute the script.
Example 1: SHOGUN by Bowtie2 maps:
ogu_from_maps.py bowtie2_result_dir output -m bowtie2 -e .sam.bz2 -t nucl2g.txt
- The
nucl2g.txt
is a map of genome sequences (nucleotide) to genome IDs. We provide this file in this repository. One may also customize it. .sam.bz2
is the extension filename of each mapping file (SAM format). We assume that it was already compressed usingbzip2
.
Example 2: Centrifuge maps:
ogu_from_maps.py centrifuge_result_dir output -m centrifuge -e .map.xz -t nucl2g.txt
The script generates three OGU tables. They differ by the way non-unique hits are treated:
all.tsv
: Includes all hits to each genome, regardless of ambiguity.uniq.tsv
: Only considers unique hits per genome (i.e., query sequences simultaneously mapped to multiple genomes are not considered).norm.tsv
: When one query sequence is mapped to k genomes, each genome receives 1 / k hit.
Either one can be used for the downstream analyses. Choice depends on specific research goal and experimental design. Let’s use uniq.tsv
as an example hereafter.
Data formatting
One can further convert the .tsv file into the BIOM format, which is the standard for microbiome studies and broader bioinformatics.
biom convert -i table.tsv -o table.biom --table-type="OTU table" --to-hdf5
To work with BIOM format one needs the Python package biom-format
. Multiple bioinformatics packages such as QIIME2 already include it.
Data refining
A microbiome dataset usually needs to be refined in order to obtain optimal results. Please refer to QIIME2 tutorials for data filtering and rarefaction.
In addition, we provide filter_otus_per_sample.py, which allows filtering by threshold at a per-sample base, which is useful in some cases, especially when false positive mapping is a concern. For example:
filter_otus_per_sample.py input.biom 0.0001 output.biom
- This command filters out OGU with less than 0.01% assignments per sample.
OGU analysis using QIIME2
Importing data
We recommend using QIIME2 to analyze microbiome datasets. To do so, one needs to convert the BIOM table into a QIIME2 artifact:
qiime tools import --type FeatureTable[Frequency] --input-path table.biom --output-path table.qza
The reference phylogeny provided in the also needs to be imported into QIIME2 as an artifact:
qiime tools import --type Phylogeny[Rooted] --input-path tree.nwk --output-path tree.qza
Lazy person’s all-in-one solution
Here you go:
qiime diversity core-metrics-phylogenetic \
--i-phylogeny tree.qza \
--i-table table.qza \
--p-sampling-depth 1000 \
--m-metadata-file metadata.tsv \
--output-dir .
And enjoy the output!
This might be overly simple (and the sampling depth 1,000 is arbitrary). We would rather you spend bit more time reading and understanding the analyses under the hood. QIIME2’s “Moving Pictures” tutorial is a good starting point. We also list individual relevant analyses below.
Alpha diversity analysis using Faith’s PD
Alpha diversity describes the microbial diversity within each community. Faith’s PD (Faith, 1992) is an alpha diversity metric that incorporates phylogenetic distances (i.e., branch lengths) in the equation. It can be calculated using QIIME2’ alpha-phylogenetic command:
qiime diversity alpha-phylogenetic \
--i-phylogeny tree.qza \
--i-table table.qza \
--p-metric faith_pd \
--output-dir .
Beta diversity analysis using UniFrac
Beta diversity describes the microbial diversity across different communities. UniFrac (Lozupone and Knight, 2006) is a group of beta diversity metrics that concern the phylogenetic distances among OTUs. Recently, we improved the implementation of UniFrac (McDonald et al., 2018), allowing efficient analysis of very large datasets (e.g., 100k+ samples). These are provided by QIIME2’s beta-phylogenetic command.
qiime diversity beta-phylogenetic \
--i-phylogeny tree.qza \
--i-table table.qza \
--p-metric weighted_unifrac \
--output-dir .
- Here we used the weighted UniFrac metric as an example, which considers the relative abundances of OGUs. It is usually encouranged to test other metrics too and compare the results.
The beta diversity analysis generates a distance matrix among samples, on which multiple downstream analyses can be performed. Examples are PCoA and subsequent visualization, PERMANOVA, Mantel test, kNN classification. We encourage you to explore the QIIME2 documentation and workshops to find out more!