Tree building
Core protocol of this project. Phylogentic reconstruction of evolutionary relationships of individual gene families and whole genomes.
This protocol is fully programmed, without the need for any manual curation. Although in the actual analysis one needs to manually kick-start individual computational tasks, since they are so heavy with this amount of data and have to be scheduled with care on a super cluster (in our case, SDSC Comet).
1. Tools
- UPP 2.0
- FastTree 2.1.9
- RAxML 8.2.10
- IQ-TREE 1.6.1
- PhyloPhlAn 2c0e61a
- TreeShrink 1.1
- ASTRAL 5.12.6a
Note: the commands listed below have ignored multi-threading parameters and filename-specific parameters.
2. Sequence alignment
2.1. Per-gene alignment
Align amino acid sequences using UPP.
run_upp.py -s seqfile.fa -B 100000 -M -1 -T 0.66 -m amino
- UPP was run with all sequences but fragments are used for the backbone. Sequences with length L < 0.34*M or L > 1.33*M, where M is the median length of all the sequences, were detected as fragments and not included in the backbone.
After aligned the sequences by UPP, we filtered out sites with more than 95% gaps, then filtered out sequences with more than 66% gaps.
2.2 Gene filtering
Drop marker genes with more than 75% gaps in the alignment matrix.
This left 381 out of 400 marker genes, and left 10,575 out of 11,079 genomes.
3. Gene trees
3.1. Model selection
We used ProteinModelSelection.pl
as bundled in RAxML:
perl ProteinModelSelection.pl align.fa > best_model.txt
3.2. Tree building
Build a starting tree using FastTree:
FastTree -lg -gamma -seed 12345 align.fa > fasttree.nwk
Remove outliers (low quality sequences, contaminations, etc.) presented as unproportionally long branches in the FastTree trees using TreeShrink:
run_treeshrink.py -i input_dir -t fasttree.nwk -a align.fa -o output_dir
Infer gene tree topology using CAT in RAxML. Three runs were performed, one with the FastTree tree as the starting tree; the other two with random seeds:
raxmlHPC -m PROTCATLG -F -f D -D -s align.fa -t fasttree.nwk
raxmlHPC -m PROTCATLG -F -f D -D -s align.fa -p 12345
raxmlHPC -m PROTCATLG -F -f D -D -s align.fa -p 23456
- The amino acid substitution matrix
LG
can be replaced with other gene-specific matrices.
Optimize branch lengths and compute likelihood score using Gamma in RAxML:
raxmlHPC -m PROTGAMMALG -f e -s align.fa -t cat.nwk
If one or more of the three runs fail due to computational limitation, use IQ-TREE instead for all three:
iqtree -m LG+G4 -s align.fa -te cat.nwk
Among the three trees, keep one with the highest likelihood score.
4. Species tree by concatenation
4.1. Site sampling
The 381 gene alignments were concatenated into a supermatrix. To reduce computational cost, and to improve alignment reliability, we performed two types of site sampling:
- Select up to k most conserved sites per gene, using the Trident scoring function implemented in PhyloPhlAn.
import phylophlan as ppa
ppa.subsample('path/to/input/folder',
'path/to/output/folder',
ppa.onehundred,
ppa.trident,
'substitution_matrices/pfasum60.pkl',
unknown_fraction=0.3)
- Randomly select k sites per gene, from sites with less than 50% gaps.
import phylophlan as ppa
ppa.subsample('path/to/input/folder',
'path/to/output/folder',
ppa.onehundred,
ppa.random_score,
'substitution_matrices/pfasum60.pkl',
unknown_fraction=0.3)
In the current release, k = 100.
4.2. Tree building
The procedures are overall similar to those used for generating individual gene trees.
Generate a starting tree using FastTree:
FastTree -lg -gamma -seed 12345 concat.phy > fasttree.nwk
Infer topology using CAT in RAxML:
raxmlHPC -m PROTCATLG -F -f D -s concat.phy -t fasttree.nwk
raxmlHPC -m PROTCATLG -F -f D -s concat.phy -p 12345
raxmlHPC -m PROTCATLG -F -f D -s concat.phy -p 23456
Optimize branch lengths and compute likelihood score using Gamma in IQ-TREE:
iqtree -m LG+G4 -s concat.phy -te raxml.nwk
Keep the highest-score tree of the three.
4.3. Branch supports
Branch support values were provided by 100 rapid bootstraps using RAxML:
raxmlHPC -m PROTCATLG -s concat.phy -p 12345 -x 12345 -N 100
raxmlHPC -m PROTCATLG -f b -z xboot.nwk -t iqtree.nwk
5. Species tree by summary
We use ASTRAL, which infers the optimal species tree topolgy by summarizing multiple gene trees.
5.1. Tree building
java -jar astral.jar -i gene.trees -o astral.tre
5.2. Branch supports
The output file astral.tre
reports the following branch support values:
- EN, QC, f1, f2, f3, pp1, pp2, pp3, q1, q2, q3
In which three are cared:
- EN: **Effective number (en): number of gene trees that provided information.
- q1: Quartet score (qts): proportion of gene trees that support this branch.
- pp1: Local posterior probability (lpp): computed based on the quartet score.
5.3. Branch lengths
The branch lengths in astral.tre
are in units of coalescence. To obtain “conventional” branch lengths, i.e., number of mutations per site, we used the concatenated alignments:
iqtree -s align.fa -te tree.nwk -keep-ident -m LG+G4 -pre [] -nt 24
Please also see tree manipulation.