Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new tax classes for taxonomic summarization #2443

Merged
merged 51 commits into from
Feb 8, 2023
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
b1ed900
init new LineagePair and LineageInfo classes
bluegenes Jan 5, 2023
5aa5714
test new LineagePair,BaseLineageInfo,RankLineageInfo
bluegenes Jan 6, 2023
ea7a163
fix
bluegenes Jan 6, 2023
23e2480
add v4.4 columns that will be required
bluegenes Jan 6, 2023
331336b
Merge branch 'latest' into upd-lineage-utils
bluegenes Jan 6, 2023
f458dc1
make the newly added query_bp info in test1.gather.csv work with exis…
bluegenes Jan 6, 2023
592993d
rename test1.gather_ani.csv to test1.gather_old.csv to reflect its ne…
bluegenes Jan 6, 2023
06fb848
moar lineage tests
bluegenes Jan 6, 2023
ed055b6
test remaining codecov misses
bluegenes Jan 6, 2023
c15ff78
whoops, one last codecov miss
bluegenes Jan 6, 2023
766a8b9
add tax summarization classes; rename old NamedTuples to avoid breakage
bluegenes Jan 6, 2023
808e6be
finish renaming
bluegenes Jan 6, 2023
7c9fdb4
add tests for new classes
bluegenes Jan 6, 2023
8c2c546
upd
bluegenes Jan 7, 2023
00f9a64
more tests; move status checking into ClassificationResult
bluegenes Jan 9, 2023
cbb1e65
human summary dict tests
bluegenes Jan 9, 2023
d4c43fd
add value checks and tests for SGR,CR classes
bluegenes Jan 9, 2023
724bf74
test make_full_summary
bluegenes Jan 10, 2023
9dd9adf
use f_unique, not f_weighted to preserve current functionality
bluegenes Jan 10, 2023
c33c137
test no ranks
bluegenes Jan 10, 2023
15031ef
test make_kreport_results
bluegenes Jan 10, 2023
a810953
krona fns that use the new dataclasses
bluegenes Jan 10, 2023
6d1d525
init changes to tax main
bluegenes Jan 10, 2023
fd814f6
upd load_gather fns
bluegenes Jan 10, 2023
cf39469
init metagenome changes to use new framework
bluegenes Jan 10, 2023
f03c2c3
Merge branch 'latest' into upd-tax-summarization
bluegenes Jan 10, 2023
5f924d3
Merge branch 'upd-tax-summarization' into use-new-tax-classes
bluegenes Jan 10, 2023
826bc6b
fix lineagepair
bluegenes Jan 11, 2023
9a3f4b7
Merge branch 'upd-tax-summarization' into use-new-tax-classes
bluegenes Jan 11, 2023
e3a80c2
fix kreport num_bp; upd some err msg
bluegenes Jan 11, 2023
afc88db
add ksize, scaled to mimic v4.4+ output
bluegenes Jan 11, 2023
1a82fe4
upd krona writing
bluegenes Jan 11, 2023
79f7b47
use RankLineageInfo() instead of () for unclassified/empty tax
bluegenes Jan 11, 2023
7f689bb
display lineage within aggregate; all metagenome tests working
bluegenes Jan 11, 2023
d50e9b1
init changes for genome, annotate
bluegenes Jan 11, 2023
256d400
fix annotate header; make sure load_gather doesnt ask for query_name …
bluegenes Jan 11, 2023
66630e2
enable lineage_csv output from genome
bluegenes Jan 11, 2023
6d43fdf
now fail if gather <4.4
bluegenes Jan 11, 2023
a2d80d0
bp is int, not float
bluegenes Jan 11, 2023
0f97276
upd write classif
bluegenes Jan 11, 2023
c459d11
add ksize and scaled to mimic v4.4
bluegenes Jan 11, 2023
bf2e154
use fraction, not f_weighted to match prior functionality
bluegenes Jan 11, 2023
7183f76
test name_at_rank, as_lineage_dict
bluegenes Jan 11, 2023
5309872
test metagenome single-query output format restrictions
bluegenes Jan 13, 2023
4e9ce04
test single query output formats; multi q krona
bluegenes Jan 13, 2023
a8ee8dc
cleaner
bluegenes Jan 13, 2023
60ded63
Merge branch 'latest' into use-new-tax-classes
bluegenes Jan 16, 2023
5d51cfd
Merge branch 'latest' into use-new-tax-classes
bluegenes Jan 31, 2023
19374af
cleanup taxonomy code after refactor (#2446)
bluegenes Feb 7, 2023
8902211
Merge branch 'latest' into use-new-tax-classes
bluegenes Feb 7, 2023
1a1bb1a
apply suggestions from code review
bluegenes Feb 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
229 changes: 91 additions & 138 deletions src/sourmash/tax/__main__.py

Large diffs are not rendered by default.

370 changes: 290 additions & 80 deletions src/sourmash/tax/tax_utils.py

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions tests/test-data/tax/47+63_x_gtdb-rs202.gather.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp
5238000,0.6642150646715699,1.0,0.6642150646715699,0.6642150646715699,,,,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,38729c6374925585db28916b82a6f513,1.0,5238000,0,2648000,,47+63,491c0a81,7886000
5177000,0.6564798376870403,0.5114931427467645,0.3357849353284301,0.3357849353284301,,,,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,09a08691ce52952152f0e866a59f6261,1.0,2648000,1,0,,47+63,491c0a81,7886000
intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp,ksize,scaled
5238000,0.6642150646715699,1.0,0.6642150646715699,0.6642150646715699,,,,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,38729c6374925585db28916b82a6f513,1.0,5238000,0,2648000,,47+63,491c0a81,7886000,31,1000
5177000,0.6564798376870403,0.5114931427467645,0.3357849353284301,0.3357849353284301,,,,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,09a08691ce52952152f0e866a59f6261,1.0,2648000,1,0,,47+63,491c0a81,7886000,31,1000
2 changes: 1 addition & 1 deletion tests/test-data/tax/test1.gather.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_name,query_md5,query_filename,query_bp,ksize,scaled,query_n_hashes
442000,0.08815317112086159,0.08438335242458954,0.08815317112086159,0.05815279361459521,1.6153846153846154,1.0,1.1059438185997785,"GCF_001881345.1 Escherichia coli strain=SF-596, ASM188134v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,683df1ec13872b4b98d59e98b355b52c,0.042779713511420826,442000,0,4572000,test1,md5,test1.sig,5014000,31,1000,2507
390000,0.07778220981252493,0.10416666666666667,0.07778220981252493,0.050496823586903404,1.5897435897435896,1.0,0.8804995294906566,"GCF_009494285.1 Prevotella copri strain=iAK1218, ASM949428v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,1266c86141e3a5603da61f57dd863ed0,0.052236806857755155,390000,1,4182000,test1,md5,test1.sig,50140000,31,1000,2507
390000,0.07778220981252493,0.10416666666666667,0.07778220981252493,0.050496823586903404,1.5897435897435896,1.0,0.8804995294906566,"GCF_009494285.1 Prevotella copri strain=iAK1218, ASM949428v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,1266c86141e3a5603da61f57dd863ed0,0.052236806857755155,390000,1,4182000,test1,md5,test1.sig,5014000,31,1000,2507
138000,0.027522935779816515,0.024722321748477247,0.027522935779816515,0.015637726014008795,1.391304347826087,1.0,0.5702120455914782,"GCF_013368705.1 Bacteroides vulgatus strain=B33, ASM1336870v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,7d5f4ba1d01c8c3f7a520d19faded7cb,0.012648945921173235,138000,2,4044000,test1,md5,test1.sig,5014000,31,1000,2507
338000,0.06741124850418827,0.013789581205311542,0.010769844435580374,0.006515719172503665,1.4814814814814814,1.0,0.738886568268889,"GCF_003471795.1 Prevotella copri strain=AM16-54, ASM347179v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,0ebd36ff45fc2810808789667f4aad84,0.04337782340862423,54000,3,3990000,test1,md5,test1.sig,5014000,31,1000,2507
14 changes: 7 additions & 7 deletions tests/test-data/tax/test1_x_gtdbrs202_genbank_euks.gather.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp
442000,0.08815317112086159,0.08438335242458954,0.08815317112086159,0.05815279361459521,1.6153846153846154,1.0,1.1059438185997785,"GCF_001881345.1 Escherichia coli strain=SF-596, ASM188134v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,683df1ec13872b4b98d59e98b355b52c,0.042779713511420826,442000,0,4572000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000
390000,0.07778220981252493,0.10416666666666667,0.07778220981252493,0.050496823586903404,1.5897435897435896,1.0,0.8804995294906566,"GCF_009494285.1 Prevotella copri strain=iAK1218, ASM949428v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,1266c86141e3a5603da61f57dd863ed0,0.052236806857755155,390000,1,4182000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000
206000,0.041084962106102914,0.007403148134837921,0.041084962106102914,0.2215344518651246,13.20388349514563,3.0,69.69466823965065,"GCA_002754635.1 Plasmodium vivax strain=CMB-1, CMB-1_v2",/home/irber/sourmash_databases/outputs/sbt/genbank-protozoa-x1e6-k31.sbt.zip,8125e7913e0d0b88deb63c9ad28f827c,0.0037419167332703625,206000,2,3976000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000
138000,0.027522935779816515,0.024722321748477247,0.027522935779816515,0.015637726014008795,1.391304347826087,1.0,0.5702120455914782,"GCF_013368705.1 Bacteroides vulgatus strain=B33, ASM1336870v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,7d5f4ba1d01c8c3f7a520d19faded7cb,0.012648945921173235,138000,3,3838000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000
338000,0.06741124850418827,0.013789581205311542,0.010769844435580374,0.006515719172503665,1.4814814814814814,1.0,0.738886568268889,"GCF_003471795.1 Prevotella copri strain=AM16-54, ASM347179v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,0ebd36ff45fc2810808789667f4aad84,0.04337782340862423,54000,4,3784000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000
110000,0.021938571998404467,0.000842978957948319,0.010370961308336658,0.023293696041700604,5.5,2.5,7.417494911978758,"GCA_000256725.2 Toxoplasma gondii TgCatPRC2 strain=TgCatPRC2, TGCATPRC2 v2",/home/irber/sourmash_databases/outputs/sbt/genbank-protozoa-x1e6-k31.sbt.zip,2a3b1804cf5ea5fe75dde3e153294548,0.0008909768346023004,52000,5,3732000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000
intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp,ksize,scaled
442000,0.08815317112086159,0.08438335242458954,0.08815317112086159,0.05815279361459521,1.6153846153846154,1.0,1.1059438185997785,"GCF_001881345.1 Escherichia coli strain=SF-596, ASM188134v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,683df1ec13872b4b98d59e98b355b52c,0.042779713511420826,442000,0,4572000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000,31,1000
390000,0.07778220981252493,0.10416666666666667,0.07778220981252493,0.050496823586903404,1.5897435897435896,1.0,0.8804995294906566,"GCF_009494285.1 Prevotella copri strain=iAK1218, ASM949428v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,1266c86141e3a5603da61f57dd863ed0,0.052236806857755155,390000,1,4182000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000,31,1000
206000,0.041084962106102914,0.007403148134837921,0.041084962106102914,0.2215344518651246,13.20388349514563,3.0,69.69466823965065,"GCA_002754635.1 Plasmodium vivax strain=CMB-1, CMB-1_v2",/home/irber/sourmash_databases/outputs/sbt/genbank-protozoa-x1e6-k31.sbt.zip,8125e7913e0d0b88deb63c9ad28f827c,0.0037419167332703625,206000,2,3976000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000,31,1000
138000,0.027522935779816515,0.024722321748477247,0.027522935779816515,0.015637726014008795,1.391304347826087,1.0,0.5702120455914782,"GCF_013368705.1 Bacteroides vulgatus strain=B33, ASM1336870v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,7d5f4ba1d01c8c3f7a520d19faded7cb,0.012648945921173235,138000,3,3838000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000,31,1000
338000,0.06741124850418827,0.013789581205311542,0.010769844435580374,0.006515719172503665,1.4814814814814814,1.0,0.738886568268889,"GCF_003471795.1 Prevotella copri strain=AM16-54, ASM347179v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,0ebd36ff45fc2810808789667f4aad84,0.04337782340862423,54000,4,3784000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000,31,1000
110000,0.021938571998404467,0.000842978957948319,0.010370961308336658,0.023293696041700604,5.5,2.5,7.417494911978758,"GCA_000256725.2 Toxoplasma gondii TgCatPRC2 strain=TgCatPRC2, TGCATPRC2 v2",/home/irber/sourmash_databases/outputs/sbt/genbank-protozoa-x1e6-k31.sbt.zip,2a3b1804cf5ea5fe75dde3e153294548,0.0008909768346023004,52000,5,3732000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,multtest,9687eeed,5014000,31,1000
Loading