Phylogeny-based profiling
Overview
This protocol introduces how to perform metagenomic profiling using existing tools (e.g., Kraken, Centrifuge, SHOGUN), but instead of classifying sequences by taxonomy, it directly assigns sequences to tips or internal nodes of the reference phylogeny. This delivers higher resolution, more accurate assignment, and enables better intepretation of data in light of evolution.
Phylogenetic hierarchy
The practice of “metagenomic profiling”, or “classification” typically relies on the taxonomic hierarchy (i.e., phylum - class - order … species). It is overly simple in terms of describing the relationships among organisms, comparing with a phylogenetic tree, especially when the number of reference genomes is large.
Let us take genome G000008865 (Escherichia coli O157:H7 str. Sakai) for example. Its taxonomic lineage is:
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia; s__Escherichia coli
Whereas its “phylogenetic lineage” according to our reference tree is:
N3;N7;N14;N41;N66;N101;N154;N234;N332;N452;N782;N971;N1163;N1390;N1641;N1913;N2521;N2839;N3180;N3530;N3891;N4259;N5003;N5851;N6294;N6723;N7150;N7546;N7949;N8335;N8680;N8963;N9223;N9443;N9629;N9782;N9907;N10016;N10109;N10189;N10254;N10316;N10378;N10427;N10459;N10486;N10507;N10528;N10546;G000008865
Many more levels, isn’t it?
Profiling with a phylogenetic tree will deliver much finer resolution than with the taxonomic hierarchy. Depending on one’s research scope, this could deliver new insights into the data.
Plus, taxonomy is known to be inaccurate and changeable (think of the example of Escherichia vs. Shigella, and that of Clostridium), whereas phylogeny is more accurate and better mathematically defined.
Fake taxonomy from tree
Most modern profilers use the taxonomic hierarchy to guide classification. Actually, for multiple programs, this “hierarchy” is flexible, and one can replace it with the phylogenetic lineages shown above so that the program will think “phylogenetically”.
We provide tree_to_taxonomy.py, which converts a phylogenetic tree into “fake taxonomy” files:
tree_to_taxonomy.py tree.nwk
- Optionally, this script allows the user to specify branch support and branch length cutoffs. Please see its command line inteface.
This script generates fake taxdump files (names.dmp
and nodes.dmp
), a genome ID to fake TaxID map (g2tid.txt
), and a fake lineage map (g2lineage.txt
).
One may also want to generate a nucleotide accession to fake TaxID map. This can be done using a Bash command:
join -12 -21 nucl2g.txt g2tid.txt -o1.1,2.2 -t$'\t' > nucl2tid.txt
- The file nucl2g.txt is provided in this repository.
Now let us see some example usages.
Kraken
Kraken (Wood and Salzberg, 2014) is a widely used taxonomic profiler for WGS data. It relies on the taxonomic hierarchy to determine the lowest common ancestor (LCA) of reference genomes where (k-mers of) the query sequence is mapped to.
Here we show how to replace the default taxonomic hierarchy (the NCBI taxdump) with our reference phylogeny. We will use Kraken 1.0 for example.
Database building
-
Specify a directory to host the Kraken database, say,
dbdir
. -
Create subdirectory
taxonomy
. Place the fake taxdump files (names.dmp
andnodes.dmp
) we generated above into it. -
Create subdirectory
library/added/
. Place the uncompressed genome sequences (e.g.,G000123456.fna
) into this directory. -
In
library/added/
, create a fileprelim_map.txt
. Its content is like:
TAXID <tab> NC_000001.1 <tab> 12345
TAXID <tab> NC_000002.1 <tab> 6789
...
This can be done using a simple Bash command, from the file nucl2tid.txt
we already generated above:
sed -e 's/^/TAXID\t/' nucl2tid.txt > library/added/prelim_map.txt
- Now build the Kraken database:
kraken-build --build --db dbdir --threads 32
- Here
32
is the number of CPUs equipped in your system. Please customize.
- Finally, clean up the database directory.
kraken-build --clean
Profiling
The commands are exactly the same as one does with a regular (original) Kraken database:
kraken --db dbdir --threads 32 --paired R1.fq.gz R2.fq.gz --output output.tsv 2> log.txt
kraken-report --db dbdir output.tsv > output.report
Now take a look at output.report
. You will see a list of number of sequences and relative abundances of tip (starting with G
) and internal nodes (starting with N
).
Here is a real example. We took the low-complexity toy dataset (S_S001
) from the 1st CAMI challenge (Sczyrba et al., 2017). If one profiles it with the original NCBI taxonomy, the report file looks like:
0.00 0 0 U 0 unclassified
100.00 40334552 0 - 1 root
100.00 40334552 40829 - 131567 cellular organisms
94.29 38031114 131753 D 2 Bacteria
48.16 19423204 7711 - 1783272 Terrabacteria group
34.64 13973035 2274120 P 1239 Firmicutes
18.88 7616977 3106 C 91061 Bacilli
18.71 7547478 5791 O 1385 Bacillales
18.20 7340520 1046 F 186822 Paenibacillaceae
18.19 7335590 1066307 G 44249 Paenibacillus
...
With the phylogenetic hierarchy, it is like:
0.00 0 0 U 0 unclassified
100.00 40334552 40830 - 1 N1
94.29 38031111 4043 - 3 N3
94.25 38016970 808 - 7 N7
94.25 38016043 89756 - 14 N14
48.16 19423150 7382 - 39 N39
34.64 13972410 32 - 226 N226
34.64 13971873 875 - 321 N321
34.58 13949620 138821 - 587 N587
32.83 13241176 0 - 754 N754
...
One can convert this file into a plain node ID to relative abundance map:
cat S_S001.report | grep -v $'unclassified$' | awk -v OFS='\t' '{print $NF, $1}' > S_S001.txt
If one wants to set a lower bound for the relative abundance (say, 0.1%), add:
... | awk '$2>=0.1' > S_S001.txt
Centrifuge
Centrifuge (Kim et al., 2016) is a metagenomic classifier which features the compression of closely related reference genomes. The classification step is also guided by the taxonomic hierarchy. Similarily, we can replace the taxonomy with phylogeny.
Build database:
centrifuge_build --seed 42 --threads 32 --conversion-table nucl2tid.txt --taxonomy-tree nodes.dmp --name-table names.dmp nucl.fna dbname
Run search:
centrifuge --seed 42 -p 32 -1 R1.fq.gz -2 R2.fq.gz -x dbname -S output.map --report-file output.report
Generate a Kraken-style report we explained above:
centrifuge-kreport -x dbname output.tsv > output.kreport
SHOGUN - UTree
SHOGUN (Hillmann et al., 2018) is a novel metagenomics pipeline developed by our team. It features the accurate handling of shallow shotgun sequencing data. With as few as 0.5 million sequences per sample (thus cost is very low as comparable to 16S rRNA sequencing), one can obtain decent classification and diversity analysis results that are comparable to the outcome of deep sequencing.
SHOGUN provides three classification tools for user choice: Bowtie2, UTree and BURST. Bowtie2 and BURST are pure sequence aligners, and taxonomic classification step is performed after the map is obtained. Therefore they can take advantage of the curated taxonomy (see genome database for how) but not the phylogenetic tree itself.
UTree is a k-mer profile & taxonomic hierarechy-guided classifier. In this sense it is similar to Kraken, but it is significantly faster and consumes significantly less computational resource (especially memory).
The support for phylogeny instead of taxonomy is built-in in UTree.
Build database:
utree-build_gg reference.fna nucl2tid.txt temp.ubt 0 2
utree-compress temp.ubt dbname.ctr
rm temp.ubt temp.ubt.gg.log
Run search by directly calling UTree:
utree-search_gg dbname.ctr input.fa output.txt 32 RC
Or via the SHOGUN interface:
shogun align -t 32 -d dbdir -a utree -i input.fa -o .
UTree generates a mapping file, in which each query sequence is directly assigned to a taxonomic lineage (here phylogenetic clade).
Visualization
We developed Empress, an interactive tree viewer that renders trees with 1M+ tips on the fly. Here is a live version showing the current reference phylogeny. It allows one to upload and visualize the profiling result as color gradient on the tree.
In the “Colors” tab of the interface, upload the node ID-to-abundance file we discussed above (S_S001.txt
). In the tip/node color dropdown, there will be an extra option “S_S001”. Select it.
Here is a tree-shaped heatmap of your microbiota:
The root of the tree has the deepest color, indicating that at this level, the relative abundance is close to 100%. Starting from the root, the color depth decreases and diversifies. Some lineages are darker than the background. They are what the microbial community is composed of.
One can move the mouse cursor over those dark nodes to see their IDs and taxonomic annotations (using either NCBI or GTDB). One can collapse the tree at certain taxonomic rank in the “Collapse” tab to get a cleaner view.