Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative GQ values after vcf_to_hdf5 #341

Open
grlewis333 opened this issue Sep 22, 2020 · 3 comments
Open

Negative GQ values after vcf_to_hdf5 #341

grlewis333 opened this issue Sep 22, 2020 · 3 comments

Comments

@grlewis333
Copy link

Hi, I was comparing some different VCF extraction packages and have been really impressed by the extraction speed in the h5 output of vcf_to_hdf5. However, I was doing some basic tests just to check outputs were the same from each method and have found some strange behaviour in the scikit-allel output.

Specifically I found that my GQ values >128 appear to have a negative offset...

Here is an example plotting GQ against sample DP and you can see the (orange) scikit-allel result has this strange offset:
image

For example, row 265 of the VCF is:
chr17 1005289 . A G 431133 PASS AC=6;AF=0.823;AN=6;DP=109787;FS=0;MQ=249.96;MQRankSum=5.615;QD=3.94;ReadPosRankSum=3.074;SOR=0.7 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP 1/1:0,53:1:53:156:PASS:.:.:244,159,0:206,156,0 1/1:0,40:1:40:117:PASS:.:.:205,120,0:167.3,117.3,0 1/1:0,50:1:50:147:PASS:.:.:235,150,0:197.4,147.4,0

You can see that the correct GQ for sample 1 is 156, but the value I get through scikit-allel is -100:

allel.vcf_to_hdf5('fpath.vcf.gz', 'fpath.h5', fields='*', overwrite=True)
callset = h5py.File(fpath.h5,mode='r')
GQ = callset['calldata/GQ']
GQ_0 = GQs[:,0]
print(GQ_0[265])

>>> -100

I found 271 cases where the GQ is offset and in each cases it is offset by either 28 (256) or 211 (2048) which can't be a coincidence.

Any idea what might be causing this issue?

@hardingnj
Copy link
Collaborator

Looks like integer overflow.
GQ is limited to 99, so the default dtype will reflect this.
(though at time of writing can't find a source for this).

You can pass a dtype dict to the function if you have a good reason for GQ >= 128.

@grlewis333
Copy link
Author

Ah thanks very much! I just added types={'calldata/GQ': 'i2'} as a parameter in the vcf_to_hdf5 call and now it works as expected.

Cheers for the fast response!

@hardingnj
Copy link
Collaborator

hardingnj commented Sep 23, 2020

Cool- nice work on the comparisons by the way- useful to have this type of thing out there. Please drop a note to the google group if you write this up into a blog post/similar.

I might reopen this- just to decide whether it's worth having an explicit check on the max value of GQ, otherwise could catch people out. Not an easy thing to trace.

What is the source of your VCF? If it's something mainstream producing GQs > 99 we should look at supporting this.

NB:

'calldata/GQ': 'i1',

Is the relevant line that defaults to i1.

@hardingnj hardingnj reopened this Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants