Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number=G in INFO records #272

Open
bioinformed opened this issue Dec 26, 2017 · 29 comments
Open

Number=G in INFO records #272

bioinformed opened this issue Dec 26, 2017 · 29 comments

Comments

@bioinformed
Copy link

The cardinality and order of the values stored in a FORMAT record with Number=G is well-defined and based on the ploidy of GT -- no problem here. However, http://gnomad.broadinstitute.org/ is now defining INFO records with Number=G. This is explicitly allowed by the current VCF 4.2 and 4.3 specifications, but no guidance is given on how to determine cardinality -- just that values are ordered the same as FORMAT/GT. An issue was filed with the pysam project to clarify the desired semantics for such records (see pysam-developers/pysam#593).

In practice, should such INFO entries be treated as Number=.? Or assume ploidy is 2 (ick!)? It seems a little late in the game to attempt to outlaw using Number=G for INFO records.

@yfarjoun
Copy link
Contributor

This seems wrong to me. If gnomad is the only offender, I think that we can still rescue this....

Let me ask offline.

@yfarjoun
Copy link
Contributor

I think that we could change the spec to say something like "In the case of Number=G in the INFO field, this is only valid if all the genotypes in that record have the same ploidy, and thus the same number of genotypes."

This way we allow genotype annotations (like prior or what have you) in the INFO field, while not limiting to ploidy=2.

@pd3
Copy link
Member

pd3 commented Jan 2, 2018

+1 I agree with this.

@jkbonfield jkbonfield added the vcf label Jan 2, 2018
@bioinformed
Copy link
Author

Odd. My previous reply didn't show up. My concerns with the proposal are:

  1. N+1 problem: Take a VCF file with N diploid samples and add one with a few hemizygous or triploid loci and now, as a side-effect, the semantics of certain INFO fields now change? This seems like a great recipe for confusion.
  2. Sites-only files. As in the gnomAD case, what ploidy should be assumed?

I'm starting to think that future revisions of the standard should deprecate support for Number=G in INFO, and for backward compatibility, treat Number=G as Number=. in INFO fields. A more out-of-the-box solution would be to add a Ploidy= metadata element to INFO declarations to allow/require setting a fix ploidy.

@pd3
Copy link
Member

pd3 commented Jan 2, 2018

I don't understand the concerns. Knowing the number of alleles, one can determine the ploidy.

@bioinformed
Copy link
Author

@pd3: Ah-- you'd be fine with allowing any number of entries, provided it represented a valid ploidy given the alleles present in the record. There would be no need to infer the ploidy from samples in that case (in contrast to @yfarjoun's proposal).

I'm +1 on this approach, as it is something that I can implement and document in pysam.

@yfarjoun
Copy link
Contributor

yfarjoun commented Jan 2, 2018

@pd3 I don't understand your comment.

you could have 2 alleles in the VCF record, and 3 alleles in the genotype...or vice versa...so how does the number of alleles (in the record) tell you the ploidy?

@pd3
Copy link
Member

pd3 commented Jan 2, 2018

If we know the number of ALT alleles, a VCF producer can write INFO/Number=G annotations with the correct number of fields. A VCF reader can determine the ploidy given the number of ALT alleles and the number of fields in the annotation.

The silent assumption is that the ploidy is shared across all samples. If the ploidy is mixed, INFO/Number=G is not informative of the samples with different ploidy.

@yfarjoun
Copy link
Contributor

yfarjoun commented Jan 2, 2018

I still do not understand how you can tell from the number of alleles what the ploidy is.

Can you tell what the ploidy is for the following REF/ALT values:

  • A/C,T (2 alt alleles)
  • A/C (1 alt alleles)
  • A/C,G,T (3 alt alleles)

For either of these examples, I could be genotyping a haploid (ploidy =1), diploid (poidy =2) or any other value of ploidy (ploidy=??) so, again, I do not understand what you mean by

If we know the number of ALT alleles, a VCF producer can write INFO/Number=G annotations with the correct number of fields.

@pd3
Copy link
Member

pd3 commented Jan 2, 2018

There are two cases:

  1. determine the number of fields given the number of alleles and ploidy:
    .. for ploidy=1, nGTs = nAls
    .. for ploidy=2, nGTs = nAls*(nAls+1)/2

  2. determine the ploidy given the number of alleles and the number of fields:
    .. for nGTs=nAls, the ploidy is 1
    .. for nGTs=nAls*(nAls+1)/2, the ploidy is 2

@yfarjoun
Copy link
Contributor

yfarjoun commented Jan 2, 2018

Right, so you agree that nAls is not enough to determine nGTs, you need to know the ploidy!

Since the ploidy is (possibly) sample dependent, the OP is correct that there needs to be a clarification in the spec.

@bioinformed
Copy link
Author

In the case of pysam, the accessor code will always know the number of alleles and number of values for the field when setting or getting an INFO field with Number=G and can thus infer the ploidy of the values.

The "getter" will always return the values as-is, but may not be able to help with mapping from genotypes (given by a sequence of alleles/allele-indices) to genotype values if the number of values do not match a valid ploidy given the number of alleles.

The "setter" will be able to raise an exception of the number of values does not match a valid ploidy, given the number of alleles.

@konradjk
Copy link

konradjk commented Jan 2, 2018

Just catching up on the thread. Apologies for the confusion - gnomAD can indeed be assumed to be ploidy=2 for all GC fields, even GC_Male on non-PAR-X:

X 3181970 [...] GC_Male=4146,0,2855

with zeros for all heterozygotes. Not ideal, but at least it's consistent with the raw data of diploid calling for all samples on all chromosomes (the appropriate adjustments are made in the Hemi (male) and Hom (female) tags).

In any case, I appreciate that assuming ploidy is not particularly general and agree that it's not exactly clear in the absence of genotype data, so I'd be happy to add a Ploidy= tag to the header or whatever is decided down the line based on the spec.

@bioinformed
Copy link
Author

Thanks for your feedback @konradjk. I'll update pysam to return any number of values in this case, but require setters to check that the number of values to match any valid ploidy state for the given alleles.

@cseed
Copy link

cseed commented Jan 2, 2018

@bioinformed I agree inferring the ploidy from the samples is dangerous (not to mention potentially expensive). However, fixing the ploidy in the header means the ploidy can't vary by variant which might be desirable e.g. for sex chromosomes. What about a PLOIDY INFO field that is used to interpret Number=G there?

Personally it doesn't seem like this is buying us much and I prefer deprecating Number=G and treating it like Number=. We will update Hail to reject Number=G in the INFO field for now until the spec is amended with something more precise.

@lh3
Copy link
Member

lh3 commented Jan 2, 2018

I agree an INFO field should not have "Number=G". We should deprecate such Number in the spec. Htslib/bcf2 is keeping the local array length. "Number=G" is treated largely the same way as "Number=." anyway.

I don't quite like a header Ploidy tag because it doesn't apply to sex chromosomes or more complex scenarios (e.g. encoding CNV-aware genotypes in VCF). A PLOIDY INFO field complicates implementation but doesn't work with chrX, either. I think just deprecating "Number=G" for INFO fields should be fine.

@lh3
Copy link
Member

lh3 commented Jan 2, 2018

PS: gnomAD may consider to replace "Number=G" with "Number=3" for INFO fields on the assumption of diploidy.

@konradjk
Copy link

konradjk commented Jan 2, 2018

@lh3 - the reason it's not 3 is because of multi-allelics (tri- would be 6, etc).

@cseed - you probably know this, but these were output using Hail (callstats.GC). It was a much preferable output than ExAC's Het (which was a horrifying G-numbered array with the homozygotes removed, so really: G - A - 1 😧 ), since now it's at least a bit more parse-able. But yes, it assumes a fixed ploidy, which isn't ideal (fine for us, but perhaps not for a spec).

@d-cameron
Copy link
Contributor

I agree that we should deprecate Number=G in INFO fields. The specs should be generic enough to be meaningful in all situations - not just fixed ploidy germline analysis. Dealing with anuploidy and highly amplified regions is bad enough in the genotype fields (e.g. try representing phased genotypes of SNVs occurring in sites with 20+ copies), leaking fix-ploidy assumptions into INFO fields will just make correctly encoding variants when doing germline/primary/metastatic multi-sample analysis of tumours with high SV burdens even more horrible than it already is.

@pd3
Copy link
Member

pd3 commented Feb 8, 2018

Thinking about this more, I prefer not to deprecate Number=G for INFO fields. As @konradjk correctly pointed out, there is no good replacement for this. In theory Number=. could be used instead, but then merging such fields with tools like bcftools merge is impossible.

I fully appreciate the concern that such fields may be ill-defined at sites with mixed ploidy. However, I don't like the idea of forbidding tags like NumberForEveryDiploidGenotype. The semantics of sites with mixed ploidy is not one-size-fits-all type of decision and should be decided on a per-case basis. Currently there is no such tag reserved by the VCF specification.

@konradjk
Copy link

konradjk commented Feb 8, 2018

As an aside, gnomAD will be removing its G-numbered entries from the INFO field for our new release (2.1 and beyond), as we have no evidence they, or any previous broken iterations of them, were ever used and they've become unruly to calculate.

@cyenyxe
Copy link
Member

cyenyxe commented May 1, 2018

I would be strongly in favor of forbidding INFO tags with Number=G. No well-known tags use it, the small number of reported use cases had decided to drop it, and as @pd3 mentions "such fields may be ill-defined at sites with mixed ploidy".

I don't think it is worth introducing partially broken support just to allow to merge the (extremely infrequent) tags that could need it.

@pd3
Copy link
Member

pd3 commented May 2, 2018

@cyenyxe Turning the argument around, what is the harm of allowing a feature just because in some extremely rare and hypothetical cases it is not well defined? Dividing by 0 is not defined, yet 0 is an extremely useful concept.

@yfarjoun
Copy link
Contributor

yfarjoun commented May 2, 2018

I think that it would be best to nail down what G in INFO means without having to process all the genotypes (also, what if there are no genotypes?)

I would suggest using a default of assuming PLOIDY=2 unless there's an additional INFO field ("PLOIDY") that states otherwise.

@ctsa
Copy link

ctsa commented May 2, 2018

@pd3 - I would suggest this is not rare, e.g. mixed chrX ploidy for trios.

@d-cameron
Copy link
Contributor

some extremely rare and hypothetical cases it is not well defined

Somatic variant calling is neither rare nor hypothetical. Technically, this even comes up in germline calling in regions of tandem duplication but I suspect all current germline callers just discard the variants because they can't correctly call AFs of 0.33/0.33/0.33 (het dup) or 0.25/0.25/0.25/0.25 (hom dup) in a 'diploid' genome.

what is the harm of allowing a feature

Given VCFv4.3 explicitly allows them, we need to either a) fully define the feature, or b) remove it. Every feature in the spec imposes a technical burden on all compliant implementations. Every feature has to be able justify the cost. I don't believe this feature can justify itself and should be removed.

In practice, should such INFO entries be treated as Number=.? Or assume ploidy is 2 (ick!)? It seems a little late in the game to attempt to outlaw using Number=G for INFO records.

If we're not going to outlaw them, then we must explicitly define what one value for each possible genotype actually means for an INFO field. Since it's a specifications document, we need a definition that actually make sense for every valid VCF file.

The definition that makes the most sense to me is to have the INFO ploidy as the maximum ploidy inferred from the FORMAT GT field (since it's required), defaulting to haploid if not FORMAT information is available present. We then need to determine how these map to lower ploidy samples and I see two reasonable approaches: a) an inferred placeholder reference genotype for the higher order chromatids, and b) spreading the 'value' of the field over all possible higher order chromatid possibilities.

Unfortunately, depending on the sematics of the field either, the implementer will want to do a) for some fields, and b) for others. We'd have to either disallow b) (since it only works for numeric), or add add yet another metadata field indicating which approach was used for that field. Again, I reiterate: I don't believe this feature can justify the complexity required to properly implement it and it should be removed.

While we're at it, we also need to clarify exactly what we mean by a genotype. If a diploid germline sample has a tandem duplication with respect to the reference on one chromatid and has two ALT alleles at that location, what goes in the GT field? How do we report the phasing of this? Defining what happens in these sort of edge cases where there are multiple reasonable possible resolutions is exactly why we need this specifications document and why we need to resolve and ambiguities in it.

@pd3
Copy link
Member

pd3 commented May 3, 2018

@ctsa @d-cameron Of course I am aware that any multi-sample VCF with chromosome X will likely contain samples of different ploidy and can be frequent in somatic calling. My tongue-in-cheek comment was aimed at all the non-existent tags we are trying to ban.

To be clear, the cases discussed above are not what I am suggesting at all. My proposal is: Let's not change the specification and outlaw Number=G INFO tags, instead clarify that they can only refer to one ploidy (one within a record, otherwise flexible, see below). I doubt this will be a hugely useful or widely used feature, but I don't see what harm it can make and why it should be banned.

Note that such hypothetical INFO tags need not have any connection to FORMAT, the same way INFO/AA or INFO/DB does not. Importantly, the semantics of Number=G INFO tags cannot be defined globally, every such tag must be allowed to follow its own semantics.

Also, we do not have to enforce any assumptions about ploidy as it can be inferred unambiguously from the number of elements and alleles (see comments above in this thread). Note that this is not new and is already the case with FORMAT fields -- imagine PL in the absence of GT, each sample can have different ploidy and must be determined exactly this way.

@d-cameron
Copy link
Contributor

Given today's discussion, I'm happy to keep them as long as they are stand-alone fields without any sepc-constrained relationship to the FORMAT ploidy. We'd also need to add some clarification text that the ploidy of the INFO G fields should be inferred from the number of ALT alleles, and the length of the G field.

@pd3
Copy link
Member

pd3 commented May 3, 2018

I agree. The section 1.6.2 on GL tag already contains a detailed explanation of the number and ordering of the fields. Maybe this could be made into a stand alone section which both places could refer to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants