-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save disk space by compressing fastq files (after trimming and filtering) #136
Comments
I have altered the snakemake rules in DennisSchmitz/jovian@30a5ec0 and DennisSchmitz/jovian@28c1cf1. These should change all intermediate fastq files to gzipped variants. I am still running tests to see how well this works compared to the non-gzipped pipeline. |
I am done with the benchmark of 9 bacterial metagenomic datasets. In short the conclusions are:
Additional remarks:
|
I just now noticed two other rules that depend on these intermediate fastq files: |
Thank for your thorough analysis and the report you emailed, really nice! So I'm a bit at a loss about how to proceed. Being able to reduce the footprint of certain intermediate files by >50% is really nice. Especially since users on our internal servers are now being capped by a ROM quota (which I'm way above 👼). But I really don't like how it adds 50 minutes of additional processing time per sample. Maybe instead of making it a flag in the config file, it can be added as a flag to the wrapper? Then end-users can choose for themselves if they want to compress after an analysis has finished? |
Yes, 50 minutes extra per sample is not a very desirable change. Also, it might be a bit of a hassle to adapt those other two rules to work with compressed files (which again may slow down the whole process). As an alternative, we have suggested to try and not wait for the 'onsuccess' part at the end of the pipeline to remove unnecessary files, but to use temp() statements as rule outputs for rules whose output is no longer necessary after they have been further processed by the next rule. E.g. after trimming, the trimmed reads only need to be mapped by bowtie2 (background removal part 1). Then they can be removed. If the output of trimming is written something like output:
temp("data/cleaned_fastq/{sample}.fastq") then snakemake can automatically remove this file when it is no longer necessary. This of course also causes the disk usage to be smaller during processing. Apparently using temp() before gave some trouble, especially when the pipeline crashed halfway. I am trying it right now to see if I run into problems. |
With large datasets and limited disk capacity, saving intermediate fastq files as raw fastq may take up hundreds of GBs of disk space. Disk usage may be decreaused by using gzipped fastq files. The tools that we currently have in the pipeline can all work with gzipped fastq files:
scaffold_analyses.yaml
. Also, this supposedly works faster if you also have pigz installed, which is inJovian_master_environment.yaml
)To implement this, the following rules have to be adjusted:
Clean_the_data
: trimmomatic's outputQC_clean_data
: FastQC's inputHuGo_removal_1
: bowtie2's inputHuGo_removal_2/3
: bedtools to bbtools reformat.sh (also requires new conda env)De_novo_assembly
: SPAdes's inputall
: {sample}_{read}.fq.gz?I am currently testing these and want to create a new branch (from
dev
) when I get these working. I will also try and do a little benchmark to get an idea of the performance of the 'gzipped pipeline' against the current version with raw fastq files.Please let me know if you have any other ideas!
The text was updated successfully, but these errors were encountered: