Protocols for building, analyzing, and using the trees and other resources in this project.


  • Genome retrieval: Download all bacterial and archaeal genomes available from NCBI GenBank and RefSeq, using RepoPhlAn.

  • Genome sampling: Select n genomes form a genome pool such that they maximize included biodiversity as measured by the k-mer signatures of genomes.

  • Marker identification: Identify and extract amino acid sequences of 400 global marker genes from genomes, using PhyloPhlAn.

  • Tree building: Build phylogenetic trees of genes and species using various approaches.

  • Tree manipulation: manipulate phylogenetic trees using the Python scripts developed by our team.

  • Taxon subsampling: Select n taxa from a larger phylogenetic tree such that it maximizes representation of deep-branching, large clades.

  • Taxonomy curation: Evaluate, modify and extend existing taxonomic assignments based on a phylogenetic tree.


  • Tree comparison: Compare the phylogenetic relationships and distances indicated by individual species trees.

  • Tree comparison by depth: Compare the topologies of two trees with consideration of phylogenetic depth.

  • Major clade dimension: Calculate and compare the dimensions of major clades (e.g., Archaea vs. Bacteria), including distances between crown groups and distances between leaves.

  • Shared clades: Collapse two very large trees to a shared set of large clades to enable back-to-back comparison via tanglegram.

  • Gene tree discordance: Analyze evolutionary discrepancy reflected by individual gene trees.

  • Saturation test: Analyze potential amino acid substitution saturation and how it impacts estimated phylogenetic distances.

  • GTDB translation: Process GTDB taxonomy and trees to enable cross-translation with our work.


  • Tree rendering: Collapse tree at given rank(s) and generate files ready for iTOL and FigTree rendering.


  • Genome database: Build a reference genome database with phylogeny-curated taxonomy to improve an existing metagenomic sequence classification workflow.

  • Community ecology: Convert WGS sequence alignments into a “gOTU table” and perform microbial community ecology analyses with the reference phylogeny.

  • Tree profiling: Modify an existing metagenomic profiling workflow to allow sequences to be directly assigned to tips and internal nodes of the reference phylogeny.