Skip to content

Commit

Permalink
MRG: Add taxonomic utilities for LINs and enable tax metagenome (#2469
Browse files Browse the repository at this point in the history
)

## Add taxonomic utilities for LINs; enable and test `tax metagenome`

With taxonomy refactoring (#2437, #2439, #2443, #2446, #2466, #2467), we
are (mostly) no longer tied to named ranks. Here, I add a class for LIN
taxonomies and use it within `tax metagenome` to allow summarization up
LINs and reporting at specified `lingroups`.

With this PR, users can now use the flag `--lins` to read and use `lin`
taxonomies from the provided tax (`-t`, `--taxonomy`) file. If used,
`sourmash tax` will look for a `lin` column in the taxonomy file instead
of looking for `superkingdom`...`strain` columns. The `lin` column
should contain `;`-separated LINs, preferably with a standard number of
positions (e.g. all 20 positions in length or all 10 positions in
length).

For `tax metagenome`:

By default, `tax metagenome` will summarize up _all_ available ranks/LIN
positions. If a `lingroup` file is provided, we will also report a
subset of this summary: just the LIN prefixes that match groups in the
`lingroup` file. The `lingroup` file requires two columns in any order:
`name`, the name of the group, and `lin`, the lin prefix of the group.
The prefix will be used to select results from the full summary for
reporting. The `lingroup` format will build a file with the following
name: `{base}.lingroup.tsv`, where `{base}` is the name provided via the
`-o`,` --output-base` option.

## Demo / Tutorial

A draft tutorial is available
[here](https://sourmash--2469.org.readthedocs.build/en/2469/tutorial-lin-taxonomy.html).
Note that it does not contain the installation info for this branch (see
below). You can run the interactive version via binder
[here](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb)


## Testing

### Option A: Use the Demo Binder

You can test via the
[binder](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb).
You can add new cells or modify any existing cells, and even download
additional files for testing. The downside is that you'll have to make
sure to download and save your results, since the binder won't save them
for you.

### Option B: Alternatively, install on your own computer/cluster: 
Here is one way to test this code before it gets fully integrated into
sourmash:
- If you don't have conda, I'd recommend installing `mamba`,
[instructions
here](https://mamba.readthedocs.io/en/latest/installation.html) instead.
- if you do have `mamba`, replace the word `conda` with `mamba` in the
following commands.

Download an environment file that points to this branch:
```
curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/sourmashLIN.yml
```
Create a virtual environment using this file:
```
conda env create -f sourmashLIN.yml
```
Activate that environment:
```
conda activate smashLIN
```
make sure `--lins` is in the `--help` for `sourmash tax metagenome`:
```
sourmash tax metagenome --help
```



## Command to run

The command to run is this one: 

```
sourmash tax metagenome -g $gather_csv  -t $taxonomy_csv \
                        --lins --lingroup $lingroups_csv
```

## Types of files you'll need

1. sketches of query metagenome
2. sketches of reference genomes (database)
3. taxonomy file with LIN information (two columns required: `ident`,
`lin`)
4. lingroup information file (two columns required: `name`, `lin`) 


To exit the environment when you're done testing, use `conda deactivate`

> Reminder, if you have `mamba`, you can use it in place of `conda` in
the commands above.


example `lingroup` output format. Note that the `1;0`.. paths are always
grouped together, but may come before or after the `0;0` and `2;0`
groups.

```
name	lin	percent_containment	num_bp_contained
lg3	2;0;0	1.56	192000
lg1	0;0;0	5.82	714000
lg2	1;0;0	5.05	620000
lg3	1;0;1	0.65	80000
lg4	1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0	0.65	80000
```

```
name	lin	percent_containment	num_bp_contained
lg2	1;0;0	5.05	620000
lg3	1;0;1	0.65	80000
lg4	1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0	0.65	80000
lg1	0;0;0	5.82	714000
lg3	2;0;0	1.56	192000
```


## A few implementation details:

- In `tax_utils.py`, I add a `LINLineageInfo` class for using and
manipulated LIN taxonomies. It implements new methods to enable
specifically reading in `LIN` taxonomies into the class, but otherwise
uses the taxonomic utilities available in `BaseLineageInfo`, e.g.
taxonomic summarization up ranks, assessing whether two taxonomies are a
match at a given rank.
- In `tax_utils.py`, I add functionality for reading `lingroup`
information and reporting taxonomic summarization specifically at these
ranks.

Changes and Additions:
- [x] Add `LINLineageInfo` for working with `LIN` taxonomies
- [x] Add method for reading `LIN`s into `LineageDB`
- [x] Add methods for reading `LINgroups` and summarizing to these
- [x] Add LineageTree that can use `LineageInfo` to perform
`build_tree`, `find_lca` functions (originally in `lca_utils.py`) and
produce an ordered list of lineage paths
- [x] Add code + tests to use `LIN`s taxonomy in:
  - [x] tax metagenome
  - [x] tax annotate
  - [x] tax summarize
 
The following require additional changes and will be punted to an
issue/separate PR (see #2499):
  - tax genome
  - tax prepare
  - tax grep

---------

Co-authored-by: C. Titus Brown <titus@idyll.org>
  • Loading branch information
bluegenes and ctb committed Apr 5, 2023
1 parent 43ec4ec commit 827b897
Show file tree
Hide file tree
Showing 15 changed files with 2,021 additions and 139 deletions.
36 changes: 33 additions & 3 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ information; these are grouped under the `sourmash tax` and
* `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
* `tax genome` - summarize single-genome gather results and report most likely classification.
* `tax annotate` - annotate gather results with lineage information (no summarization or classification).
* `tax prepare` - prepare and/or combine taxonomy files.
* `tax grep` - subset taxonomies and create picklists based on taxonomy string matches.
* `tax summarize` - print summary information (counts of lineages) for a taxonomy lineages file or database.

Expand Down Expand Up @@ -491,7 +492,8 @@ The sourmash `tax` or `taxonomy` commands integrate taxonomic
`gather` command (we cannot combine separate `gather` runs for the
same query). For supported databases (e.g. GTDB, NCBI), we provide
taxonomy csv files, but they can also be generated for user-generated
databases. For more information, see [databases](databases.md).
databases. As of v4.8, some sourmash taxonomy commands can also use `LIN`
lineage information. For more information, see [databases](databases.md).

`tax` commands rely upon the fact that `gather` provides both the total
fraction of the query matched to each database matched, as well as a
Expand Down Expand Up @@ -530,8 +532,13 @@ sourmash tax metagenome
--taxonomy gtdb-rs202.taxonomy.v2.csv
```

There are three possible output formats, `csv_summary`, `lineage_summary`, and
`krona`.
The possible output formats are:
- `human`
- `csv_summary`
- `lineage_summary`
- `krona`
- `kreport`
- `lingroup_report`

#### `csv_summary` output format

Expand Down Expand Up @@ -707,6 +714,29 @@ example sourmash `{output-name}.kreport.txt`:
```


#### `lingroup` output format

When using LIN taxonomic information, you can optionally also provide a `lingroup` file with two required columns: `name` and `lin`. If provided, we will produce a file, `{base}.lingroups.tsv`, where `{base}` is the name provided via the `-o`,` --output-base` option. This output will select information from the full summary that match the LIN prefixes provided as groups.

This output format consists of four columns:
- `name`, `lin` columns are taken directly from the `--lingroup` file
- `percent_containment`, the total percent of the dataset contained in this lingroup and all descendents
- `num_bp_contained`, the estimated number of base pairs contained in this lingroup and all descendents.

Similar to `kreport` above, we use the wording "contained" rather than "assigned," because `sourmash` assigns matches at the genome level, and the `tax` functions summarize this information.

example output:
```
name lin percent_containment num_bp_contained
lg1 0;0;0 5.82 714000
lg2 1;0;0 5.05 620000
lg3 2;0;0 1.56 192000
lg3 1;0;1 0.65 80000
lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000
```

Related lingroup subpaths will be grouped in output, but exact ordering may change between runs.

### `sourmash tax genome` - classify a genome using `gather` results

`sourmash tax genome` reports likely classification for each query,
Expand Down
7 changes: 7 additions & 0 deletions doc/databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ You can read more about the different database and index types [here](https://so

Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1 and up.

## Taxonomic Information (for non-LCA databases)

For each prepared database, we have also made taxonomic information available linking each genome with its assigned lineage (`GTDB` or `NCBI` as appropriate).
For private databases, users can create their own `taxonomy` files: the critical columns are `ident`, containing the genome accession (e.g. `GCA_1234567.1`) and
a column for each taxonomic rank, `superkingdom` to `species`. If a `strain` column is provided, it will also be used.
As of v4.8, we can also use LIN taxonomic information in tax commands that accept the `--lins` flag. If used, `sourmash tax` commands will require a `lin` column in the taxonomy file which should contain `;`-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length). Some taxonomy commands also accept a `lingroups` file, which is a two-column file (`name`, `lin`) describing the name and LIN prefix of LINgroups to be used for taxonomic summarization.

## Downloading and using the databases

All databases below can be downloaded via the command line with `curl -JLO <url>`, where `<url>` is the URL below. This will download an appropriately named file; you can name it yourself by specify `'-o <output>` to specify the local filename.
Expand Down
Loading

0 comments on commit 827b897

Please sign in to comment.