Skip to content

Subcommand: filter

Lucas Czech edited this page Aug 1, 2022 · 4 revisions

Filter jplace files according to some criteria, that is, remove all queries and/or placement locations that do not pass the provided filter(s).

Usage: gappa edit filter [options]

Options

Input
--jplace-path Required. TEXT:PATH(existing)=[] ...
List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed.
Placement Filters
--normalize-before FLAG
Before filtering placements, normalize the initial placement masses (likelihood weight ratios) by proportially scaling them so that they sum to one per pquery.
--min-accumulated-mass FLOAT:FLOAT in [0 - 1]=0
Only keep the most likely placements per query so that their accumulated mass is above the given minimum value.
--min-mass-threshold FLOAT:FLOAT in [0 - 1]=0
Only keep those placements per query whose mass is above the given minimum threshold.
--max-n-placements UINT=0
Only keep the n most likely placements per query.
--min-pendant-len FLOAT=0
Only keep placements with at least the given pendant length.
--max-pendant-len FLOAT=0
Only keep placements with at most the given pendant length.
--no-remove-empty FLAG
After filtering placements, there might be pqueries that do not have any placement locations remaining. By default, the whole pquery is removed in this case, as it is useless. However, if this flag is set, they are kept as empty pqueries with just their name.
--normalize-after FLAG
After filtering placements, normalize the remaining placement masses (likelihood weight ratios) by proportially scaling them so that they sum to one per pquery.
Name Filters
--keep-names TEXT
Keep queries whose name matches the given names, which can be provided either as a regular expression (regex), or as a file with one name per line. Remove all others.
--remove-names TEXT
Remove queries whose name matches the given names, which can be provided either as a regular expression (regex), or as a file with one name per line. Keep all others.
Output
--out-dir TEXT=.
Directory to write output files to.
--file-prefix TEXT
File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix TEXT
File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--compress FLAG
If set, compress the output files using gzip. Output file extensions are automatically extended by .gz.
Global Options
--allow-file-overwriting FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose FLAG
Produce more verbose output.
--threads UINT
Number of threads to use for calculations.
--log-file TEXT
Write all output to a log file, in addition to standard output to the terminal.

Description

The command offers filtering for two aspects of jplace files:

  1. The placement locations, that is, individual placements within a pquery.
  2. The pquery names.

See below for details on each. All filtering options can also be combined, and are executed on after another in the order that is listed above; the order in which they are provided on the command line is not taken into account.

Note that when providing multiple jplace files as input (or a directory containing multiple jplace files), the input files are merged prior to filtering. If instead filtering shall be applied per file, simply call this command individually for each file.

1. Filtering by placement location

Each pquery (i.e., a placed sequence) can contain multiple placement locations, that is, placements on multiple branches of the reference tree. Each of these locations is typically annotated with a Likelihood Weight Ratio (LWR), which can be interpreted as a probability that the placement on this branch is correct. The placement locations can be filtered by their LWR, using several filtering options.

In theory, for a given pquery, the sum of all LWRs across all branches is 1. However, to save storage, not all placement locations might be stored in a jplace file, in which case the sum is lower than 1. In order to have the remaining locations to sum to 1 again, we offer settings for normalizing (proportially re-scale) the LWRs:

Normalization of LWRs.

This can be applied before and/or after the filtering. Applying the normalization before of course can change the effect of the thresholds provided for filtering, as the LWRs can change. Applying it after is a simple way to ensure unit probability of each pquery; this however hides the fact that not all probability mass is still represented by the remaining placement locations.

2. Filtering by pquery names

Each sequence that is placed on the reference tree is stored in the jplace file using a name, typically the name from the input fasta file used with the placement program. This name can also be used for filtering, either as a list of names to keep or remove, or as a regular expression to match all names to keep or remove.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Clone this wiki locally