Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CADD to version 1.7 #513

Open
jemten opened this issue Feb 6, 2024 · 2 comments
Open

Update CADD to version 1.7 #513

jemten opened this issue Feb 6, 2024 · 2 comments
Milestone

Comments

@jemten
Copy link
Collaborator

jemten commented Feb 6, 2024

Cadd version 1.7 has been released. Among other update the scoring now also uses information from protein language models. See paper here

Pre-computed scores can be found here
https://cadd.bihealth.org/download

@jemten jemten added this to the Release 1.3.0 milestone Feb 6, 2024
@fa2k
Copy link
Contributor

fa2k commented Aug 14, 2024

I've done some work on trying to package CADD v1.7 and have not succeeded. I'd just like to post the information here, in case it may be helpful.

The approach with CADD v1.6.post1 was to run the documented command to download the conda environments into the docker container (https://github.com/BioContainers/containers/blob/60ba043b6e419b33b385d9cc4f22375a69890d84/cadd-scripts-with-envs/1.6.post1/Dockerfile#L45). In version 1.7 it didn't download all of the necessary environments, so I couldn't use the same approach.

Recently, CADD released v1.7.1. CADD now support using singularity images for the snakemake pipeline instead of using conda environments. It fixes some other bugs, and also adds a docker image with the singularity images conda environments.

Attempt 1: Based on CADD's docker image: https://github.com/fa2k/BioContainers-fork/blob/cadd-1.7/cadd-scripts-with-envs/1.7.1/Dockerfile - does not successfully load the conda environments because they exist at the wrong path.

Attempt 2: Create conda environments manually in a loop: https://github.com/fa2k/BioContainers-fork/blob/cadd-1.7/cadd-scripts-with-envs/1.7.1/Dockerfile-full

Attempt 2 produces a 24GB docker image that can successfully execute some CADD commands when combined with the modified cadd module here: https://github.com/fa2k/raredisease/blob/caddtest/modules/nf-core/cadd/main.nf

The linked CADD module contains some additional work-arounds.

The current iteration crashes in snakemake rule annotate_regseq on command:


          python /opt/CADD-scripts-1.7.1/src/scripts/lib/tools/regulatorySequence/predictVariants.py         --variants /tmp/tmp.UBUhIOYTIu/NA12878_rhocall_vcfanno_filter_0004-scattered_indels.esm.vcf.gz         --model data/annotations/GRCh38_v1.7/regseq/Hyperopt400InclNegatives.json         --weights data/annotations/GRCh38_v1.7/regseq/Hyperopt400InclNegatives.h5         --reference data/annotations/GRCh38_v1.7/regseq/reference.fa         --genome data/annotations/GRCh38_v1.7/regseq/reference.fa.genome         --output /tmp/tmp.UBUhIOYTIu/NA12878_rhocall_vcfanno_filter_0004-scattered_indels.regseq.vcf.gz &> /tmp/tmp.UBUhIOYTIu/NA12878_rhocall_vcfanno_filter_0004-scattered_indels.annotate_regseq.log

with error message:

...
vcfpy.exceptions.IncorrectVCFFormat: Ill-formatted line starting with "#CHROM"

The input VCF to this rule is missing the FORMAT column.

I'm about to give up for a while on CADD, because there are too many problems. But I thought it may help to share this progress, and maybe someone has some tips for how to continue trying.

@jemten
Copy link
Collaborator Author

jemten commented Aug 19, 2024

Thanks for taking the time to test @fa2k. We'll try to pick this up after the release of the next version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants