Skip to content

Subcommand: clean tree

Lucas Czech edited this page Jan 16, 2022 · 1 revision

Clean a tree in Newick format by removing parts that other parsers have difficulties with.

Usage: gappa prepare clean-tree [options]

Options

Input
--tree-file Required. TEXT:FILE
Tree file in Newick format.
Settings
--remove-inner-labels FLAG
Some Newick trees contain inner node labels, which can confuse some parsers. This option removes them.
--replace-invalid-chars FLAG
Replace invalid characters in node labels ( ,:;"()[]) by underscores. The Newick format requires node labels to be wrapped in double quotation marks if they contain these characters, but many parsers cannot handle this. For such cases, replacing the characters can help.
--remove-comments-and-nhx FLAG
The Newick format allows for comments in square brackets [], which are also often (mis-)used for ad-hoc and more established extensions such as the New Hampshire eXtended (NHX) format [&&NHX:key=value:...]. Many parsers cannot handle this; this option removes such annotations.
--remove-extra-numbers FLAG
The Rich/Rice Newick format extension allows to annotate bootstrap values and probabilities per branch, by adding additional :[bootstrap]:[prob] fields after the branch length. Many parsers cannot handle this; this option removes such annotations.
--remove-jplace-tags FLAG
The Jplace file format for phylogenetic placements also uses a custom Newick extension, by introducing curly brackets to annotate edge numbers in the tree {1}. We are not aware of any other Newick extension that uses this style, but still, with this option, all annotations in curly brackets is removed.
Output
--out-dir TEXT=.
Directory to write output files to.
--file-prefix TEXT
File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix TEXT
File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Global Options
--allow-file-overwriting FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose FLAG
Produce more verbose output.
--threads UINT
Number of threads to use for calculations.
--log-file TEXT
Write all output to a log file, in addition to standard output to the terminal.

Description

The command cleans a tree in Newick format (and some of its extensions) by removing parts that might lead some downstream parsers to fail.

The Newick file format for phylogenetic trees in its original standard only supports node names (taxa names) and branch lengths. Over the years, many ad-hoc and custom extensions have been suggested and used in practice, to compensate for missing flexibility of the format. This however lead to many downstream parsers not being able to work with all those dialects of the format, see

A Critical Review on the Use of Support Values in Tree Viewers and Bioinformatics Toolkits.
Czech L, Huerta-Cepas J, Stamatakis A.
Molecular Biology and Evolution, 17(4), 2017.
https://doi.org/10.1093/molbev/msx055

for some of the issues that might arise.

This command can be used to clean some of those difficult extensions/annotations, by simply removing them. It is meant as a cleaning tool for other software packages that cannot read a given Newick tree. When all options are activated, all types of extra data (that we know of) are removed, leading to a tree with just node names at the terminal (leaf) nodes, and branch lengths. Note that branch lengths might slightly change even if nothing is removed, due to numerical rounding.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Clone this wiki locally