-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number=G in INFO records #272
Comments
This seems wrong to me. If gnomad is the only offender, I think that we can still rescue this.... Let me ask offline. |
I think that we could change the spec to say something like "In the case of Number=G in the INFO field, this is only valid if all the genotypes in that record have the same ploidy, and thus the same number of genotypes." This way we allow genotype annotations (like prior or what have you) in the INFO field, while not limiting to ploidy=2. |
+1 I agree with this. |
Odd. My previous reply didn't show up. My concerns with the proposal are:
I'm starting to think that future revisions of the standard should deprecate support for |
I don't understand the concerns. Knowing the number of alleles, one can determine the ploidy. |
@pd3: Ah-- you'd be fine with allowing any number of entries, provided it represented a valid ploidy given the alleles present in the record. There would be no need to infer the ploidy from samples in that case (in contrast to @yfarjoun's proposal). I'm +1 on this approach, as it is something that I can implement and document in pysam. |
@pd3 I don't understand your comment. you could have 2 alleles in the VCF record, and 3 alleles in the genotype...or vice versa...so how does the number of alleles (in the record) tell you the ploidy? |
If we know the number of ALT alleles, a VCF producer can write INFO/Number=G annotations with the correct number of fields. A VCF reader can determine the ploidy given the number of ALT alleles and the number of fields in the annotation. The silent assumption is that the ploidy is shared across all samples. If the ploidy is mixed, INFO/Number=G is not informative of the samples with different ploidy. |
I still do not understand how you can tell from the number of alleles what the ploidy is. Can you tell what the ploidy is for the following REF/ALT values:
For either of these examples, I could be genotyping a haploid (ploidy =1), diploid (poidy =2) or any other value of ploidy (ploidy=??) so, again, I do not understand what you mean by
|
There are two cases:
|
Right, so you agree that nAls is not enough to determine nGTs, you need to know the ploidy! Since the ploidy is (possibly) sample dependent, the OP is correct that there needs to be a clarification in the spec. |
In the case of The "getter" will always return the values as-is, but may not be able to help with mapping from genotypes (given by a sequence of alleles/allele-indices) to genotype values if the number of values do not match a valid ploidy given the number of alleles. The "setter" will be able to raise an exception of the number of values does not match a valid ploidy, given the number of alleles. |
Just catching up on the thread. Apologies for the confusion - gnomAD can indeed be assumed to be ploidy=2 for all GC fields, even GC_Male on non-PAR-X:
with zeros for all heterozygotes. Not ideal, but at least it's consistent with the raw data of diploid calling for all samples on all chromosomes (the appropriate adjustments are made in the In any case, I appreciate that assuming ploidy is not particularly general and agree that it's not exactly clear in the absence of genotype data, so I'd be happy to add a |
Thanks for your feedback @konradjk. I'll update |
@bioinformed I agree inferring the ploidy from the samples is dangerous (not to mention potentially expensive). However, fixing the ploidy in the header means the ploidy can't vary by variant which might be desirable e.g. for sex chromosomes. What about a PLOIDY INFO field that is used to interpret Number=G there? Personally it doesn't seem like this is buying us much and I prefer deprecating Number=G and treating it like Number=. We will update Hail to reject Number=G in the INFO field for now until the spec is amended with something more precise. |
I agree an INFO field should not have "Number=G". We should deprecate such Number in the spec. Htslib/bcf2 is keeping the local array length. "Number=G" is treated largely the same way as "Number=." anyway. I don't quite like a header Ploidy tag because it doesn't apply to sex chromosomes or more complex scenarios (e.g. encoding CNV-aware genotypes in VCF). A PLOIDY INFO field complicates implementation but doesn't work with chrX, either. I think just deprecating "Number=G" for INFO fields should be fine. |
PS: gnomAD may consider to replace "Number=G" with "Number=3" for INFO fields on the assumption of diploidy. |
@lh3 - the reason it's not 3 is because of multi-allelics (tri- would be 6, etc). @cseed - you probably know this, but these were output using Hail ( |
I agree that we should deprecate Number=G in INFO fields. The specs should be generic enough to be meaningful in all situations - not just fixed ploidy germline analysis. Dealing with anuploidy and highly amplified regions is bad enough in the genotype fields (e.g. try representing phased genotypes of SNVs occurring in sites with 20+ copies), leaking fix-ploidy assumptions into INFO fields will just make correctly encoding variants when doing germline/primary/metastatic multi-sample analysis of tumours with high SV burdens even more horrible than it already is. |
Thinking about this more, I prefer not to deprecate I fully appreciate the concern that such fields may be ill-defined at sites with mixed ploidy. However, I don't like the idea of forbidding tags like |
As an aside, gnomAD will be removing its G-numbered entries from the INFO field for our new release (2.1 and beyond), as we have no evidence they, or any previous broken iterations of them, were ever used and they've become unruly to calculate. |
I would be strongly in favor of forbidding INFO tags with Number=G. No well-known tags use it, the small number of reported use cases had decided to drop it, and as @pd3 mentions "such fields may be ill-defined at sites with mixed ploidy". I don't think it is worth introducing partially broken support just to allow to merge the (extremely infrequent) tags that could need it. |
@cyenyxe Turning the argument around, what is the harm of allowing a feature just because in some extremely rare and hypothetical cases it is not well defined? Dividing by 0 is not defined, yet 0 is an extremely useful concept. |
I think that it would be best to nail down what G in INFO means without having to process all the genotypes (also, what if there are no genotypes?) I would suggest using a default of assuming PLOIDY=2 unless there's an additional INFO field ("PLOIDY") that states otherwise. |
@pd3 - I would suggest this is not rare, e.g. mixed chrX ploidy for trios. |
Somatic variant calling is neither rare nor hypothetical. Technically, this even comes up in germline calling in regions of tandem duplication but I suspect all current germline callers just discard the variants because they can't correctly call AFs of 0.33/0.33/0.33 (het dup) or 0.25/0.25/0.25/0.25 (hom dup) in a 'diploid' genome.
Given VCFv4.3 explicitly allows them, we need to either a) fully define the feature, or b) remove it. Every feature in the spec imposes a technical burden on all compliant implementations. Every feature has to be able justify the cost. I don't believe this feature can justify itself and should be removed.
If we're not going to outlaw them, then we must explicitly define what The definition that makes the most sense to me is to have the INFO ploidy as the maximum ploidy inferred from the FORMAT GT field (since it's required), defaulting to haploid if not FORMAT information is available present. We then need to determine how these map to lower ploidy samples and I see two reasonable approaches: a) an inferred placeholder reference genotype for the higher order chromatids, and b) spreading the 'value' of the field over all possible higher order chromatid possibilities. Unfortunately, depending on the sematics of the field either, the implementer will want to do a) for some fields, and b) for others. We'd have to either disallow b) (since it only works for numeric), or add add yet another metadata field indicating which approach was used for that field. Again, I reiterate: I don't believe this feature can justify the complexity required to properly implement it and it should be removed. While we're at it, we also need to clarify exactly what we mean by a genotype. If a diploid germline sample has a tandem duplication with respect to the reference on one chromatid and has two ALT alleles at that location, what goes in the GT field? How do we report the phasing of this? Defining what happens in these sort of edge cases where there are multiple reasonable possible resolutions is exactly why we need this specifications document and why we need to resolve and ambiguities in it. |
@ctsa @d-cameron Of course I am aware that any multi-sample VCF with chromosome X will likely contain samples of different ploidy and can be frequent in somatic calling. My tongue-in-cheek comment was aimed at all the non-existent tags we are trying to ban. To be clear, the cases discussed above are not what I am suggesting at all. My proposal is: Let's not change the specification and outlaw Number=G INFO tags, instead clarify that they can only refer to one ploidy (one within a record, otherwise flexible, see below). I doubt this will be a hugely useful or widely used feature, but I don't see what harm it can make and why it should be banned. Note that such hypothetical INFO tags need not have any connection to FORMAT, the same way INFO/AA or INFO/DB does not. Importantly, the semantics of Number=G INFO tags cannot be defined globally, every such tag must be allowed to follow its own semantics. Also, we do not have to enforce any assumptions about ploidy as it can be inferred unambiguously from the number of elements and alleles (see comments above in this thread). Note that this is not new and is already the case with FORMAT fields -- imagine PL in the absence of GT, each sample can have different ploidy and must be determined exactly this way. |
Given today's discussion, I'm happy to keep them as long as they are stand-alone fields without any sepc-constrained relationship to the FORMAT ploidy. We'd also need to add some clarification text that the ploidy of the INFO G fields should be inferred from the number of ALT alleles, and the length of the G field. |
I agree. The section 1.6.2 on GL tag already contains a detailed explanation of the number and ordering of the fields. Maybe this could be made into a stand alone section which both places could refer to. |
The cardinality and order of the values stored in a FORMAT record with
Number=G
is well-defined and based on the ploidy of GT -- no problem here. However, http://gnomad.broadinstitute.org/ is now defining INFO records withNumber=G
. This is explicitly allowed by the current VCF 4.2 and 4.3 specifications, but no guidance is given on how to determine cardinality -- just that values are ordered the same as FORMAT/GT. An issue was filed with thepysam
project to clarify the desired semantics for such records (see pysam-developers/pysam#593).In practice, should such INFO entries be treated as
Number=.
? Or assume ploidy is 2 (ick!)? It seems a little late in the game to attempt to outlaw usingNumber=G
for INFO records.The text was updated successfully, but these errors were encountered: