Alignments
The multiple sequence alignments (MSAs) used for building phylogenetic trees.
- genes: Full-length alignments per gene.
Concatenated alignments (and maps of duplicates) (go to GitHub directory for download):
- cons: 381 marker genes, up to 100 most conserved sites per gene (selected using the “trident” algorithm (Valdar, 2002) implemented in PhyloPhlAn).
- rand: 381 marker genes, 100 sites per gene, randomly selected from sites with less than 50% gaps.
- rpls: 30 ribosomal proteins, identified using PhyloSift (for comparative purpose only).
Notes:
-
MSA files (with extension name
*.xz
) were compressed using LZMA to minimize disk space consumption. This format is natively supported in most Linux systems. MacOS users may installxz
using homebrew to gain support for it. -
These MSA files are already de-duplicated, i.e., if there are multiple identical sequences in the dataset, only one is retained. In addition, we provide mapping files (
*.map
) from the kept ones to their likes, and a Python script append_taxa.py to append those additional taxa to a phylogenetic tree.
MSAs are also available from our Globus endpoint: WebOfLife.