Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VQSR not fully implemented #35

Open
lczech opened this issue Jul 18, 2022 · 4 comments
Open

VQSR not fully implemented #35

lczech opened this issue Jul 18, 2022 · 4 comments

Comments

@lczech
Copy link

lczech commented Jul 18, 2022

It seems that the VQSR (GATK VariantRecalibrator) step here is not fully implemented. That rule only produces the model built with the VariantRecalibrator, but does not use ApplyVQSR to actually perform the filtering.

That means, the output files produced by that rule, all.{vartype}.recalibrated.vcf.gz, are in fact the recal files used by ApplyRecalibration, but not actually the filtered VCFs that we are looking for. Hence, an additional rule needs to be implemented that runs the gatk/applyvqsr wrapper.

If that helps, I've implemented this in my pipeline, which also offers to set all the resource files needed for the VariantRecalibrator to be provided via the config file, so that users don't have to edit the snakemake files.

@dlaehnemann
Copy link

Well spotted and thanks for documenting this in so much detail, with good suggestions. You and anybody using this pipeline is welcome to implement this via a PR, and we will look into implementing this if we can find some time. But we are not actively using this pipeline ourselves, so this might be a while.

For further inspiration, the base quality score recalibration (BQSR) setup might also be a good place to look, especially as this uses some of the respective files available in this workflow:

rule recalibrate_base_qualities:
input:
bam=get_recal_input(),
bai=get_recal_input(bai=True),
ref="resources/genome.fasta",
dict="resources/genome.dict",
known="resources/variation.noiupac.vcf.gz",
known_idx="resources/variation.noiupac.vcf.gz.tbi",
output:
recal_table="results/recal/{sample}-{unit}.grp",
log:
"logs/gatk/bqsr/{sample}-{unit}.log",
params:
extra=get_regions_param() + config["params"]["gatk"]["BaseRecalibrator"],
resources:
mem_mb=1024,
wrapper:
"0.74.0/bio/gatk/baserecalibrator"
rule apply_base_quality_recalibration:
input:
bam=get_recal_input(),
bai=get_recal_input(bai=True),
ref="resources/genome.fasta",
dict="resources/genome.dict",
recal_table="results/recal/{sample}-{unit}.grp",
output:
bam=protected("results/recal/{sample}-{unit}.bam"),
log:
"logs/gatk/apply-bqsr/{sample}-{unit}.log",
params:
extra=get_regions_param(),
resources:
mem_mb=1024,
wrapper:
"0.74.0/bio/gatk/applybqsr"

@lczech
Copy link
Author

lczech commented Jul 20, 2022

Thanks! I am also not actively using this pipeline here, as I'm developing my own that already fixes the issue, as stated above, so I won't have much time to work in this either here :-(

I also noted the BQSR setup has some similarity in GATK - although it seems that the wrappers for that changed. In previous versions, both steps of BQSR were combined into one wrapper, and only split into two later on. That might have been the reason why this was missed in the pipeline here.

@dlaehnemann
Copy link

Sure, that could be a root cause here. Or simply overlooking some details...

Does your workflow above follow the GATK best practices and do you actively maintain it? It might be worth deprecating this workflow in favor of yours, if yours ticks the right boxes...

Also, I realized yours isn't yet listed in the Snakemake Workflow Catalog. That catalog is basically a nightly GitHub crawler that aggregates all snakemake workflows that adhere either to its inclusion criteria. And if you walk the couple of extra meters to enable standardized usage, you even get a quick and easy deployment help for you workflow that you can for example cite in its README.md. As a good example, see the usage info for our dna-seq-varlociraptor workflow.

@lczech
Copy link
Author

lczech commented Jul 28, 2022

Does your workflow above follow the GATK best practices and do you actively maintain it?

As far as I am aware, it does indeed implement the best practices. And yes, at least for a while I plan to maintain it. We are currently working on a publication describing the pipeline as well.

Also, I realized yours isn't yet listed in the Snakemake Workflow Catalog.

Indeed, I'd love to have it listed in the catalog! I'll hopefully find some time to walk these extra meters soon ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants