Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A fasta file with all phage-positive contigs? #182

Closed
deminatanja opened this issue Jan 25, 2023 · 16 comments
Closed

A fasta file with all phage-positive contigs? #182

deminatanja opened this issue Jan 25, 2023 · 16 comments
Assignees
Labels
question Further information is requested

Comments

@deminatanja
Copy link

Hi,

Is it so that there is currently no final fasta file that would contain all phage-positive contigs sequences from the used tools? I think it used to exist when I was using WtP 1-2 years ago, but can't find it now in the updated version. Are there only individual fasta files from each program available now (found from the raw_data folder)?

Best regards,
Tatiana

@mult1fractal mult1fractal self-assigned this Jan 25, 2023
@mult1fractal mult1fractal added the question Further information is requested label Jan 25, 2023
@mult1fractal
Copy link
Collaborator

yes that is correct. I had to remove it because reviewers said that users will just accept this file as result without thinking and checking the other outputs.
but you can still extract the contigs of interest by yourself:

In the report go to the phage prediction by contig and then

# Filter the Phage prediction by contig table to your liking   
# Click on the CSV-Button (this will download the Phage prediction by contig table)     
# Open your Linux-Terminal     
mkdir contigs_of_interest 
cd  contigs_of_interest  
# Copy the downloaded Phage prediction by contig table to the contig_IDs_of_interest -folder  
# Copy the input_fasta to the contig_IDs_of_interest -folder  
cp WtP_results/your_sample/Input_fasta/your_input_fasta.fa.gz /foo/bar/contigs_of_interest  
# Get contig IDs of interest  
tail -n+2 final_report.utf8.csv | tr -d '"' | cut -f2 -d"," > contig_IDs_of_interest.txt  
# via Docker: use Seqkit to extract contigs of interest of your input fasta-file  
docker run --rm -it -v $PWD:/input nanozoo/seqkit:0.13.2--cd66104  
cd input  
seqkit grep --pattern-file contig_IDs_of_interest.txt your_input_fasta.fa.gz > contigs_of_interest.fa    
# Finally, close the docker with ctrl + d  

@deminatanja
Copy link
Author

Thanks for the answer!
I have to disagree with the reviewers :) , one could still extract a subset of contigs from that file and it was very handy.
When we are getting IDs of the contigs of interest, what actually happens by ' tail -n+2 final_report.utf8.csv | tr -d '"' | cut -f2 -d"," > contig_IDs_of_interest.txt ' ?

@mult1fractal
Copy link
Collaborator

well for users it would be more convenient but I also understand the reviewers point of view (without checking the results papers could be full of false positive results) 👍🏽

tail -n+2 final_report.utf8.csv | tr -d '"' | cut -f2 -d"," > contig_IDs_of_interest.txt

gives you list of the contig ids of interest and via seqkit you can extract these contigs of interest from your input fasta file

@deminatanja
Copy link
Author

deminatanja commented Jan 25, 2023

Yeah, I understand the point, but what are the "contigs of interest" in this case? What are the criteria that define them?

Would there be a simple way to have a subset of contigs that e.g.

  1. were predicted by all the used tools,
  2. have at least 1 viral gene,
  3. 10 kbp long,
  4. or can be less than 10 kbp if >50% complete.

Previously, I have been using the quality summary table to select the contigs ids that would follow these criteria and then knowing the needed ids, extracted the needed fasta sequences from the common fasta output file. I have done this half-manually, so a bit lost with the command line now.

@mult1fractal
Copy link
Collaborator

mult1fractal commented Jan 25, 2023

ahhh ok 👍🏽 got it now, sorry

so contigs of interest depends on you (based on the other outputs).
You filter the table (phage prediction by contig) to your needs e.g. prediction values >0.7, download the filtered table and execute the code I provided.

Ofc you can do that also with the checkV table :

have at least 1 viral gene,
10 kbp long,
or can be less than 10 kbp if >50% complete.

but then you need to parse the downloaded and filtered checkV table yourself to extract the contigs and sequences of interest by yourself (via seqkit)

@mult1fractal
Copy link
Collaborator

to sum up
It needs to be manually done by the user unfortunately because I cant predict what are the users want/or filter

if you want you can upload the downloaded, filtered table and then I can do the command line so you have a list of contig ids you can extract from your fasta input file

@deminatanja
Copy link
Author

Thanks! I think I got, I was missing the fact that one can filter prediction values in the table online, sorry, now I see that ;)

What are actually F1 scores by Ho et al? To filter the contigs predicted by all used tools, would you recommend to use these F1 scores for the high confidence of prediction or e.g. 0.7 in the sum_normed column?

@mult1fractal
Copy link
Collaborator

Oki

Ho et al. benchmarked the tools we use in WtP (we can only use benchmarked tools as they were "tested" and accepted by rewiewers 👍🏽 ). The F1 score is defined as the harmonic mean of precision and recall. (check here for better explaination).

As I understood it it tells you how "reliable/trusworthy" the phage prediction tools are that are being used.
They (F1) have nothing to do with the prediction values that the tools generate.

@deminatanja
Copy link
Author

Thanks for the explanation! Would you consider prediction values > 0.7 as "positive"? What was the threshold for generating all phage-positive contigs file in the previous version or was it just all above 0?

@mult1fractal
Copy link
Collaborator

At the time I set a filter (or the user was able to define a filter and set a value) above 0.5 and the contigs that were above this value were collected in the phage positive contig file.

Today I would recommend to

  1. filter the phage prediction by contig (last column > 0.75)
  2. then check the CheckV outputtable for completness and other phage indicators
  3. check the chromomap-html (what phage genes were found on the contig (not in the final-result-html))
  4. extract the contigs of interest
  5. further validate these contigs with other methods

@deminatanja
Copy link
Author

deminatanja commented Jan 26, 2023

Thanks a lot for all the help!

I have managed to make a fasta file with all positive contigs (p > 0.75), following your instructions, but using seqtk in the end, as I had that already installed.

I think there is a typo in lines

# Copy the downloaded Phage prediction by contig table to the contig_IDs_of_interest -folder  
# Copy the input_fasta to the contig_IDs_of_interest -folder 

should be contigs_of_interest -folder?

I will further explore the set I have to extract the contigs based on the criteria I mentioned above.

Btw, the chromomap file has never opened nicely for me, it has been impossible to scroll it. It might be just too large, having hundreds of thousands contigs.

@mult1fractal
Copy link
Collaborator

its just an example name on how to call the folder where you do the commands and use seqkit then.
I named it contig_IDs_of_interest.. how you name it is up to you 👍🏽

okay Thanks
I will add this to my fixing list

@deminatanja
Copy link
Author

Yeah, of course, it can be named whatever, just in this example script

mkdir contigs_of_interest 
cd  contigs_of_interest  
# Copy the downloaded Phage prediction by contig table to the contig_IDs_of_interest -folder  
# Copy the input_fasta to the contig_IDs_of_interest -folder  
cp WtP_results/your_sample/Input_fasta/your_input_fasta.fa.gz /foo/bar/contigs_of_interest  

the folder is originally called contigs_of_interest, which is also true in actual cp command, but has a bit different name in the preceding comment, that's why it caught my eye :)

@mult1fractal
Copy link
Collaborator

ahh now I got it. thanks for clarification. now it should be correct 👍🏽

@deminatanja
Copy link
Author

No problem! Thanks for all the efforts, WtP is great! ;)

@mult1fractal
Copy link
Collaborator

Thanks for using the tool :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants