Skip to content

Converts Phylosift KRONA plot outputs for several bins into an easily accessible table with majority taxa calls

License

Notifications You must be signed in to change notification settings

seanmcallister/Phylosift_to_table

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Phylosift to table

The purpose of this program is to quickly collect data from Phylosift (GitHub link) runs on multiple bins and collate into a single table.

The expectation of the input is that you have a folder containing all the Phylosift output folders for each bin. The program looks for the Krona html file and pulls the results from this file. I haven't found the other outfiles from Phylosift to be quite as informative as the Krona file, which is why I chose it.

Dependencies

The only dependencies are a couple of Perl modules I regularly use. I think these are actually available in the Perl download, so not sure you really need to install anything. The modules are: strict & Getopt::Std.

Installation

Download the repository from GitHub: Phylosift_to_table

Usage

After installation, pull up the help information:

perl phylosift_krona_to_table.pl -h

This will give you:

Help called:
Options:
-p = path to folder containing folders of phylosift results
-x = folder suffix
-r = folder prefix
-h = This help message

Example

I have an input folder that looks like this:

phylosift_results (folder/directory)
├── S1_bin1_out (folder/directory)
│   ├── S1_bin1.concat.jnlp
│   ├── S1_bin1.html
│   ├── S1_bin1.jplace
│   ├── S1_bin1.xml
│   ├── alignDir
│   ├── blastDir
│   ├── marker_summary.txt
│   ├── run_info.txt
│   ├── sequence_taxa.1.txt
│   ├── sequence_taxa.txt
│   ├── sequence_taxa_summary.1.txt
│   ├── sequence_taxa_summary.txt
│   ├── taxa_90pct_HPD.txt
│   ├── taxasummary.txt
│   └── treeDir
├── S1_bin2_out (folder/directory)
│   ├── S1_bin2.concat.jnlp
│   ├── S1_bin2.html
│   ├── S1_bin2.jplace
│   ├── S1_bin2.xml
│   ├── alignDir
│   ├── blastDir
│   ├── marker_summary.txt
│   ├── run_info.txt
│   ├── sequence_taxa.1.txt
│   ├── sequence_taxa.txt
│   ├── sequence_taxa_summary.1.txt
│   ├── sequence_taxa_summary.txt
│   ├── taxa_90pct_HPD.txt
│   ├── taxasummary.txt
│   └── treeDir
├── etc...

Thus phylosift_results is a folder containing all the phylosift output folders from all my bins; that folder is what you should put for -p.

S1_bin1_out and S1_bin2_out each have _out as a suffix to the bin name. That suffix is what is provided to -x. If you have decided to name your Phylosift output folders with a prefix, such as phylosift_S1_bin1, you can provide phylosift_ to the -r option. You cannot have both a suffix and prefix.

Example command:

phylosift_krona_to_table.pl -p ./phylosift_results -x _out > S1_phylosift_resultstable.txt

This generates a table following some rules. Primarily, I'm interested in using this script to call a taxonomy for a bin, so I am looking for a majority rule. But I don't want to be too sloppy, so there are rules for what means a majority at different taxonomic levels. Multiple majority rule taxa are possible (and separated by ",") if they pass the majority cutoff for the taxonomic level.

Taxonomic Level Majority Cutoff If below cutoff...
NULL NA WARNING: NO ASSIGNMENTS for bin
ROOT 100% WARNING: ROOT is not assigned to 100% for bin
Cellular organism 100% WARNING: Cellular organisms is not assigned to 100% for bin
Domain 90% Unclassified microbe
Phylum 50% Domain_name (mixed)
Class 40% Phylum_name (abund) [multi]
Order 30% Phylum_name (abund), Class_name (abund) [multi]
Family 30% Phylum/Class/Order (abund) [multi]
Genus 25% Phylum/Class/Order/Family (abund) [multi]
Species 25% P/C/O/F/G (abund) [multi]
Subspecies/Genome 25% P/C/O/F/G/S (abund) [multi]
Additional depths 25% P/C/O/F/G/S/Sub or Genome (abund) [multi]

The output for sample Sample 1 looks like:

Bin name Taxonomy (percent relative abundance)
S1_bin1 PROTEOBACTERIA (100) GAMMAPROTEOBACTERIA (43.9969306483449)
S1_bin2 BACTEROIDETES/CHLOROBI GROUP (100) BACTEROIDETES (100) FLAVOBACTERIIA (100) FLAVOBACTERIALES (57.1428571428571) FLAVOBACTERIACEAE (44.092230634174)
S1_bin3 PROTEOBACTERIA (100) ZETAPROTEOBACTERIA (99.89503224484) UNCLASSIFIED ZETAPROTEOBACTERIA (86.8889596220667) ZETA PROTEOBACTERIUM SCGC AB 133 C04 (48.0392969962402),ZETA PROTEOBACTERIUM SCGC AB 137 I08 (36.5954238775949),
S1_bin4 ARCHAEA (mixed)
WARNING: NO ASSIGNMENTS for bin S1_bin5
S1_bin5 BACTERIA (mixed)
S1_bin6 Unclassified microbe
WARNING: Cellular organisms is not assigned to 100% for bin S1_bin7
S1_bin8 PLANCTOMYCETES (50) CANDIDATE DIVISION NC10 (48.0386905602096),PHYCISPHAERAE (50),
S1_bin9 PROTEOBACTERIA (100) GAMMAPROTEOBACTERIA (50),ZETAPROTEOBACTERIA (50),
S1_bin10 PROTEOBACTERIA (75.4290657552175) DELTA/EPSILON SUBDIVISIONS (66.3622131607418) DELTAPROTEOBACTERIA (66.3622131607418) MYXOCOCCALES (49.1954721381515) SORANGIINEAE (28.8032419805045) POLYANGIACEAE (28.8032419805045) SORANGIUM (28.8032419805045)

In this example, bins 2/3/10 are a bit more specific, while bins 1/4/5/6 are not highly resolved by Phylosift. Bin 9 is split 50:50 between Zetaproteobacteria and Gammaproteobacteria. And bins 5/7 seem to have problems. Potentially phylosift didn't run properly on bin 5 (or didn't have any marker genes). And for bin 7, I tend to get this result when there are some viral hits in the Phylosift results (so should take a look at this manually to judge).

Please open an issue if you have any issues or questions. This program is provided without warranty or a guarantee of support. Thanks!!

About

Converts Phylosift KRONA plot outputs for several bins into an easily accessible table with majority taxa calls

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages