Alignments

The multiple sequence alignments (MSAs) used for building phylogenetic trees.

  • genes: Full-length alignments per gene.

Concatenated alignments (and maps of duplicates) (go to GitHub directory for download):

  • cons: 381 marker genes, up to 100 most conserved sites per gene (selected using the “trident” algorithm (Valdar, 2002) implemented in PhyloPhlAn).
  • rand: 381 marker genes, 100 sites per gene, randomly selected from sites with less than 50% gaps.
  • rpls: 30 ribosomal proteins, identified using PhyloSift (for comparative purpose only).

Notes:

  • MSA files (with extension name *.xz) were compressed using LZMA to minimize disk space consumption. This format is natively supported in most Linux systems. MacOS users may install xz using homebrew to gain support for it.

  • These MSA files are already de-duplicated, i.e., if there are multiple identical sequences in the dataset, only one is retained. In addition, we provide mapping files (*.map) from the kept ones to their likes, and a Python script append_taxa.py to append those additional taxa to a phylogenetic tree.

MSAs are also available from our Globus endpoint: WebOfLife.