Skip to content

Commit

Permalink
cleanup taxonomy code after refactor (#2446)
Browse files Browse the repository at this point in the history
## Taxonomy Refactor Overview

In an attempt to allow usage of NCBI taxid (motivation: CAMI
benchmarking) and alternate hierarchical taxonomic ranks (motivation:
LINS), I ended up refactoring the taxonomy code in a four-PR series.
Taxonomic summarization results should not change. Minor caveat: I was
previously obtaining `query_bp` in a hacky manner to allow gather <4.4
results. The class methods are more robust, and I'd like to stop
supporting gather <4.4 results. To allow this, I had to add the
`query_bp`, `ksize`, and `scaled` columns into some testing results to
keep tests functioning.

1. #2437 modifies
`LineagePair` from a two-item `collections.namedtuple` to a three-item
`typing.NamedTuple` containing an additional field, `taxid`, for storing
NCBI taxid information. It also introduces classes (`BaseLineageInfo`,
`RankLineageInfo`), which move lineage manipulation (from
`lca_utils.py`) to class methods in order to support robust
summarization across compatible lineages (lineages of same hierarchical
ranks). To ensure these can be used as dictionary keys, these classes
are frozen.

2. #2439 introduces classes
that facilitate reading, summarization, and writing of gather results.
First, it updates three prior `collections.namedtuple`s to `dataclasses`
used for storing information about the gather query (`QueryInfo`),
summarized gather information for metagenome queries
(`SummarizedGatherResult`) and classification information for genome
queries (`ClassificationResult`). It introduces three new classes for
reading and manipulating gather results. `GatherRow`, is used for
reading a each row from a gather file and automatically checking for
required columns. `TaxResult` is used for storing a single row from
gather file, optionally (and ideally) with taxonomic information, stored
as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing
all `TaxResult`s associated with a single query. `QueryTaxResult` add
methods to replicate the summarization previously done within
`summarize_gather_at` in `tax_utils.py` and the classification
thresholding in `genome` within `__main__.py`.

3. #2443 replaces the
actual taxonomic summarization code in `tax/__main__.py` with code that
uses the new classes. Modifies gather loading code to read using
`GatherRow`, `TaxResult`, and `QueryTaxResult`.

4. #2446 removes old,
unused functions that are rendered redundant by the new classes. Also
removes associated tests.

## Additional details for this PR (#2446) 

- Delete old functions that aren't used outside of taxonomic
summarization + associated tests
  - Including old `namedtuple`s: `QueryInf`, `SumGathInf`, `ClassInf`
- Make sure any old comments/documentation make it into new code
- Don't use unnecessary empty `()` for dataclasses
  • Loading branch information
bluegenes committed Feb 7, 2023
1 parent 5d51cfd commit 19374af
Show file tree
Hide file tree
Showing 3 changed files with 172 additions and 1,195 deletions.
4 changes: 2 additions & 2 deletions src/sourmash/tax/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
import sourmash
from ..sourmash_args import FileOutputCSV, FileOutput
from sourmash.logging import set_quiet, error, notify, print_results
from sourmash.lca.lca_utils import display_lineage, zip_lineage
from sourmash.lca.lca_utils import zip_lineage

from . import tax_utils
from .tax_utils import ClassInf, MultiLineageDB, GatherRow
from .tax_utils import MultiLineageDB, GatherRow

usage='''
sourmash taxonomy <command> [<args>] - manipulate/work with taxonomy information.
Expand Down
Loading

0 comments on commit 19374af

Please sign in to comment.