- What it is(n’t)
- How it was made
- How to get it
- How to view it
- How to use it in research
- Information for users of…
- How to cite it
What it is (and isn’t)
We present a reference phylogenetic tree (or more precisely, mutiple trees depending on your choice) for bacterial and archaeal genomes that are publicly available from NCBI RefSeq and GenBank. It means to serve as a reference for researchers to explore the evolution and diversity of microbes, and to improve the study of microbial communities.
How it was made
For comparative purpose, we also generated multiple trees using the conventional gene alignment concatenation strategy, and using multiple alternative genome and gene sampling rules. Detailed protocols are provided.
How to get it
Multiple trees, built using different input data and methodology, together with the corresponding metadata, curated taxonomy and other information, are provided in this repository. Please browse the data directory for details.
The genome and protein sequences, multiple sequence alignments and other large data files are available at Globus, with endpoint name WebOfLife (owner: firstname.lastname@example.org). For instruction on how to transfer files via Globus, please read this guide.
How to view it
We present an interactive visualization of the tree. You can zoom, collapse, label, and color the tree. Mouse over individual tips or nodes to view its taxonomy (NCBI or GTDB), to navigate to external databases, or to export download links or subtree.
We also provide high-resolution PDF images in multiple layouts and collapsed at multiple ranks, and their FigTree and iTOL-ready rendering packages, as well as the protocol and source code for rendering, at gallery.
Alternatively, you can always start with the raw Newick files, and metadata of taxa and nodes provided at data to build your own view!
How to use it in research
In addition to direct eyeballing, you can use the reference phylogeny in actual research to extend the understanding of the composition and diversity of microbial communities.
Genome and taxonomy database
The 10,575-genome catalog, with its curated taxonomy, can be compiled into a reference genome database, and plugged into your existing analysis workflow (e.g., for metagenomic profiling). See this protocol.
Microbial community ecology
This reference phylogeny enables classical diversity analyses designed during the 16S rRNA era, such as UniFrac for beta diversity, and Faith’s PD for alpha diversity, on WGS datasets. Finer-grained output is enabled at per-genome level resolution (we call it “gOTU”). See this protocol and corresponding source code.
We present a novel metagenomic profiling strategy, which solely relies on phylogeny, and NOT taxonomy, to enable higher-resolution and more accurate classification, and new insights in light of evolution. WGS data are directly assigned to internal nodes of the tree, and can be visualized in our interface. See this protocol and corresponding source code.
Information for users of
IDs of our genome pool are directly translated from NCBI assembly accessions. Duplicate genomes are merged. Copies of genome sequences are hosted at our Globus endpoint. Instructions are provided to download genomes fresh from the original NCBI server. See details.
Mappings to GTDB genomes IDs are provided in the genome metadata. In the current release, 9,732 (92.03%) of the 10,575 genomes have corresponding GTDB IDs. Annotation (and curation) of our tree using GTDB taxonomy are provided. The relative evolutionary divergence (RED) (Parks et al., 2018) of tree nodes (which are mapped to taxonomic groups) are provided.
Mappings to IMG genome/taxon IDs. are provided in the genome metadata. In the current release, 6,758 (63.91%) of the 10,575 genomes have corresponding IMG IDs.
The reference tree can be used for the diversity analysis of shotgun metagenomes, using phylogeny-aware algorithms such as UniFrac for beta diversity, and Faith’s PD for alpha diversity. See this protocol.
A derivative for 16S rRNA-based analysis is under development. Please stay tuned.
The WoL database has been implemented in Qiita. Users can analyze shotgun metagenomic data using WoL from the graphic user interface: Start from FASTQ files, choose command: “Shogun 1.0.7”, then choose optional parameter “wol_xxx” (xxx is the aligner of choice).
The 381 marker genes used to build the tree are a curated subsample of the 400 marker genes originally implemented in PhyloPhlAn. For each marker gene, we provide functional annotation, gene tree and its degree of congruence with the species evolution. Please see data/markers and data/trees/genes.
Please also check out PhyloPhlAn2.
Kraken / Centrifuge
Two things can be done for each program: (basic) The genome pool and a curated taxonomy can be compiled into an improved reference genome database for metagenomic profiling. See this protocol for details.
(advanced) The reference phylogeny can replace the NCBI taxonomy hierarchy used in a Kraken / Centrifuge analysis to guide the classification process. Query sequences are directly assigned to nodes instead of taxonomic ranks. See this protocol.
The genome pool, the curated taxonomy and the phylogenetic tree itself can be compiled into a reference database to improve metagenomic profiling. The intermediate files can be further used for community ecology analysis. See these protocols: 1, 2 and 3.
We will integrate the reference phylogeney with TIPP, a phylogenetic placement-based metagenomic sequence classifier. Please stay tuned.
If you use the data, code or protocols developed in this work, please cite:
Zhu Q*, Mai U*, Pfeiffer W, Janssen S, Asnicar F, Sanders JG, Belda-Ferre P, Al-Ghalith GA, Kopylova E, McDonald D, Kosciolek T, Yin JB, Huang S, Salam N, Jiao J, Wu Z, Xu ZZ, Sayyari E, Morton JT, Podell S, Knights D, Li W, Huttenhower C, Segata N, Smarr L, Mirarab S, Knight R. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nature Communications. 2019. 10(1):5477. doi: 10.1038/s41467-019-13443-4.
- National Science Foundation (NSF) grant 1565057
- Alfred P. Sloan Foundation grant G-2017-9838
- NSF Extreme Science and Engineering Discovery Environment (XSEDE) allocation BIO150043