Number=G in INFO records #272

bioinformed · 2017-12-26T13:37:33Z

The cardinality and order of the values stored in a FORMAT record with Number=G is well-defined and based on the ploidy of GT -- no problem here. However, http://gnomad.broadinstitute.org/ is now defining INFO records with Number=G. This is explicitly allowed by the current VCF 4.2 and 4.3 specifications, but no guidance is given on how to determine cardinality -- just that values are ordered the same as FORMAT/GT. An issue was filed with the pysam project to clarify the desired semantics for such records (see pysam-developers/pysam#593).

In practice, should such INFO entries be treated as Number=.? Or assume ploidy is 2 (ick!)? It seems a little late in the game to attempt to outlaw using Number=G for INFO records.

The text was updated successfully, but these errors were encountered:

yfarjoun · 2017-12-27T14:12:50Z

This seems wrong to me. If gnomad is the only offender, I think that we can still rescue this....

Let me ask offline.

yfarjoun · 2017-12-27T14:17:12Z

I think that we could change the spec to say something like "In the case of Number=G in the INFO field, this is only valid if all the genotypes in that record have the same ploidy, and thus the same number of genotypes."

This way we allow genotype annotations (like prior or what have you) in the INFO field, while not limiting to ploidy=2.

pd3 · 2018-01-02T08:33:47Z

+1 I agree with this.

bioinformed · 2018-01-02T12:14:05Z

Odd. My previous reply didn't show up. My concerns with the proposal are:

N+1 problem: Take a VCF file with N diploid samples and add one with a few hemizygous or triploid loci and now, as a side-effect, the semantics of certain INFO fields now change? This seems like a great recipe for confusion.
Sites-only files. As in the gnomAD case, what ploidy should be assumed?

I'm starting to think that future revisions of the standard should deprecate support for Number=G in INFO, and for backward compatibility, treat Number=G as Number=. in INFO fields. A more out-of-the-box solution would be to add a Ploidy= metadata element to INFO declarations to allow/require setting a fix ploidy.

pd3 · 2018-01-02T12:40:11Z

I don't understand the concerns. Knowing the number of alleles, one can determine the ploidy.

bioinformed · 2018-01-02T12:46:30Z

@pd3: Ah-- you'd be fine with allowing any number of entries, provided it represented a valid ploidy given the alleles present in the record. There would be no need to infer the ploidy from samples in that case (in contrast to @yfarjoun's proposal).

I'm +1 on this approach, as it is something that I can implement and document in pysam.

yfarjoun · 2018-01-02T13:10:38Z

@pd3 I don't understand your comment.

you could have 2 alleles in the VCF record, and 3 alleles in the genotype...or vice versa...so how does the number of alleles (in the record) tell you the ploidy?

pd3 · 2018-01-02T13:20:15Z

If we know the number of ALT alleles, a VCF producer can write INFO/Number=G annotations with the correct number of fields. A VCF reader can determine the ploidy given the number of ALT alleles and the number of fields in the annotation.

The silent assumption is that the ploidy is shared across all samples. If the ploidy is mixed, INFO/Number=G is not informative of the samples with different ploidy.

yfarjoun · 2018-01-02T16:35:59Z

I still do not understand how you can tell from the number of alleles what the ploidy is.

Can you tell what the ploidy is for the following REF/ALT values:

A/C,T (2 alt alleles)
A/C (1 alt alleles)
A/C,G,T (3 alt alleles)

For either of these examples, I could be genotyping a haploid (ploidy =1), diploid (poidy =2) or any other value of ploidy (ploidy=??) so, again, I do not understand what you mean by

If we know the number of ALT alleles, a VCF producer can write INFO/Number=G annotations with the correct number of fields.

pd3 · 2018-01-02T16:47:58Z

There are two cases:

determine the number of fields given the number of alleles and ploidy:
.. for ploidy=1, nGTs = nAls
.. for ploidy=2, nGTs = nAls*(nAls+1)/2
determine the ploidy given the number of alleles and the number of fields:
.. for nGTs=nAls, the ploidy is 1
.. for nGTs=nAls*(nAls+1)/2, the ploidy is 2

yfarjoun · 2018-01-02T16:52:35Z

Right, so you agree that nAls is not enough to determine nGTs, you need to know the ploidy!

Since the ploidy is (possibly) sample dependent, the OP is correct that there needs to be a clarification in the spec.

bioinformed · 2018-01-02T17:13:27Z

In the case of pysam, the accessor code will always know the number of alleles and number of values for the field when setting or getting an INFO field with Number=G and can thus infer the ploidy of the values.

The "getter" will always return the values as-is, but may not be able to help with mapping from genotypes (given by a sequence of alleles/allele-indices) to genotype values if the number of values do not match a valid ploidy given the number of alleles.

The "setter" will be able to raise an exception of the number of values does not match a valid ploidy, given the number of alleles.

konradjk · 2018-01-02T18:46:59Z

Just catching up on the thread. Apologies for the confusion - gnomAD can indeed be assumed to be ploidy=2 for all GC fields, even GC_Male on non-PAR-X:

X 3181970 [...] GC_Male=4146,0,2855

with zeros for all heterozygotes. Not ideal, but at least it's consistent with the raw data of diploid calling for all samples on all chromosomes (the appropriate adjustments are made in the Hemi (male) and Hom (female) tags).

In any case, I appreciate that assuming ploidy is not particularly general and agree that it's not exactly clear in the absence of genotype data, so I'd be happy to add a Ploidy= tag to the header or whatever is decided down the line based on the spec.

bioinformed · 2018-01-02T19:38:36Z

Thanks for your feedback @konradjk. I'll update pysam to return any number of values in this case, but require setters to check that the number of values to match any valid ploidy state for the given alleles.

cseed · 2018-01-02T20:05:25Z

@bioinformed I agree inferring the ploidy from the samples is dangerous (not to mention potentially expensive). However, fixing the ploidy in the header means the ploidy can't vary by variant which might be desirable e.g. for sex chromosomes. What about a PLOIDY INFO field that is used to interpret Number=G there?

Personally it doesn't seem like this is buying us much and I prefer deprecating Number=G and treating it like Number=. We will update Hail to reject Number=G in the INFO field for now until the spec is amended with something more precise.

lh3 · 2018-01-02T20:34:08Z

I agree an INFO field should not have "Number=G". We should deprecate such Number in the spec. Htslib/bcf2 is keeping the local array length. "Number=G" is treated largely the same way as "Number=." anyway.

I don't quite like a header Ploidy tag because it doesn't apply to sex chromosomes or more complex scenarios (e.g. encoding CNV-aware genotypes in VCF). A PLOIDY INFO field complicates implementation but doesn't work with chrX, either. I think just deprecating "Number=G" for INFO fields should be fine.

lh3 · 2018-01-02T20:41:13Z

PS: gnomAD may consider to replace "Number=G" with "Number=3" for INFO fields on the assumption of diploidy.

konradjk · 2018-01-02T20:52:24Z

@lh3 - the reason it's not 3 is because of multi-allelics (tri- would be 6, etc).

@cseed - you probably know this, but these were output using Hail (callstats.GC). It was a much preferable output than ExAC's Het (which was a horrifying G-numbered array with the homozygotes removed, so really: G - A - 1 😧 ), since now it's at least a bit more parse-able. But yes, it assumes a fixed ploidy, which isn't ideal (fine for us, but perhaps not for a spec).

d-cameron · 2018-01-03T01:27:05Z

I agree that we should deprecate Number=G in INFO fields. The specs should be generic enough to be meaningful in all situations - not just fixed ploidy germline analysis. Dealing with anuploidy and highly amplified regions is bad enough in the genotype fields (e.g. try representing phased genotypes of SNVs occurring in sites with 20+ copies), leaking fix-ploidy assumptions into INFO fields will just make correctly encoding variants when doing germline/primary/metastatic multi-sample analysis of tumours with high SV burdens even more horrible than it already is.

pd3 · 2018-02-08T21:06:04Z

Thinking about this more, I prefer not to deprecate Number=G for INFO fields. As @konradjk correctly pointed out, there is no good replacement for this. In theory Number=. could be used instead, but then merging such fields with tools like bcftools merge is impossible.

I fully appreciate the concern that such fields may be ill-defined at sites with mixed ploidy. However, I don't like the idea of forbidding tags like NumberForEveryDiploidGenotype. The semantics of sites with mixed ploidy is not one-size-fits-all type of decision and should be decided on a per-case basis. Currently there is no such tag reserved by the VCF specification.

konradjk · 2018-02-08T22:39:30Z

As an aside, gnomAD will be removing its G-numbered entries from the INFO field for our new release (2.1 and beyond), as we have no evidence they, or any previous broken iterations of them, were ever used and they've become unruly to calculate.

cyenyxe · 2018-05-01T02:11:23Z

I would be strongly in favor of forbidding INFO tags with Number=G. No well-known tags use it, the small number of reported use cases had decided to drop it, and as @pd3 mentions "such fields may be ill-defined at sites with mixed ploidy".

I don't think it is worth introducing partially broken support just to allow to merge the (extremely infrequent) tags that could need it.

pd3 · 2018-05-02T05:33:00Z

@cyenyxe Turning the argument around, what is the harm of allowing a feature just because in some extremely rare and hypothetical cases it is not well defined? Dividing by 0 is not defined, yet 0 is an extremely useful concept.

yfarjoun · 2018-05-02T13:02:52Z

I think that it would be best to nail down what G in INFO means without having to process all the genotypes (also, what if there are no genotypes?)

I would suggest using a default of assuming PLOIDY=2 unless there's an additional INFO field ("PLOIDY") that states otherwise.

ctsa · 2018-05-02T14:17:12Z

@pd3 - I would suggest this is not rare, e.g. mixed chrX ploidy for trios.

d-cameron · 2018-05-02T14:47:36Z

some extremely rare and hypothetical cases it is not well defined

Somatic variant calling is neither rare nor hypothetical. Technically, this even comes up in germline calling in regions of tandem duplication but I suspect all current germline callers just discard the variants because they can't correctly call AFs of 0.33/0.33/0.33 (het dup) or 0.25/0.25/0.25/0.25 (hom dup) in a 'diploid' genome.

what is the harm of allowing a feature

Given VCFv4.3 explicitly allows them, we need to either a) fully define the feature, or b) remove it. Every feature in the spec imposes a technical burden on all compliant implementations. Every feature has to be able justify the cost. I don't believe this feature can justify itself and should be removed.

In practice, should such INFO entries be treated as Number=.? Or assume ploidy is 2 (ick!)? It seems a little late in the game to attempt to outlaw using Number=G for INFO records.

If we're not going to outlaw them, then we must explicitly define what one value for each possible genotype actually means for an INFO field. Since it's a specifications document, we need a definition that actually make sense for every valid VCF file.

The definition that makes the most sense to me is to have the INFO ploidy as the maximum ploidy inferred from the FORMAT GT field (since it's required), defaulting to haploid if not FORMAT information is available present. We then need to determine how these map to lower ploidy samples and I see two reasonable approaches: a) an inferred placeholder reference genotype for the higher order chromatids, and b) spreading the 'value' of the field over all possible higher order chromatid possibilities.

Unfortunately, depending on the sematics of the field either, the implementer will want to do a) for some fields, and b) for others. We'd have to either disallow b) (since it only works for numeric), or add add yet another metadata field indicating which approach was used for that field. Again, I reiterate: I don't believe this feature can justify the complexity required to properly implement it and it should be removed.

While we're at it, we also need to clarify exactly what we mean by a genotype. If a diploid germline sample has a tandem duplication with respect to the reference on one chromatid and has two ALT alleles at that location, what goes in the GT field? How do we report the phasing of this? Defining what happens in these sort of edge cases where there are multiple reasonable possible resolutions is exactly why we need this specifications document and why we need to resolve and ambiguities in it.

pd3 · 2018-05-03T05:40:08Z

@ctsa @d-cameron Of course I am aware that any multi-sample VCF with chromosome X will likely contain samples of different ploidy and can be frequent in somatic calling. My tongue-in-cheek comment was aimed at all the non-existent tags we are trying to ban.

To be clear, the cases discussed above are not what I am suggesting at all. My proposal is: Let's not change the specification and outlaw Number=G INFO tags, instead clarify that they can only refer to one ploidy (one within a record, otherwise flexible, see below). I doubt this will be a hugely useful or widely used feature, but I don't see what harm it can make and why it should be banned.

Note that such hypothetical INFO tags need not have any connection to FORMAT, the same way INFO/AA or INFO/DB does not. Importantly, the semantics of Number=G INFO tags cannot be defined globally, every such tag must be allowed to follow its own semantics.

Also, we do not have to enforce any assumptions about ploidy as it can be inferred unambiguously from the number of elements and alleles (see comments above in this thread). Note that this is not new and is already the case with FORMAT fields -- imagine PL in the absence of GT, each sample can have different ploidy and must be determined exactly this way.

d-cameron · 2018-05-03T15:46:59Z

Given today's discussion, I'm happy to keep them as long as they are stand-alone fields without any sepc-constrained relationship to the FORMAT ploidy. We'd also need to add some clarification text that the ploidy of the INFO G fields should be inferred from the number of ALT alleles, and the length of the G field.

pd3 · 2018-05-03T15:50:57Z

I agree. The section 1.6.2 on GL tag already contains a detailed explanation of the number and ordering of the fields. Maybe this could be made into a stand alone section which both places could refer to.

This was referenced Dec 26, 2017

Parsing INFO fields with "Number=G" fails pysam-developers/pysam#593

Closed

{AH} permit Number=G in INFO records pysam-developers/pysam#594

Merged

bioinformed mentioned this issue Dec 28, 2017

Problematic use of Number=G in INFO records macarthur-lab/gnomad_browser#85

Closed

jkbonfield added the vcf label Jan 2, 2018

cyenyxe mentioned this issue Jan 12, 2018

check ploidy when validating PL EBIvariation/vcf-validator#101

Closed

cyenyxe added this to the VCF v4.3 final revision milestone Feb 8, 2018

pd3 added a commit to pd3/hts-specs that referenced this issue May 4, 2018

Clarify Number=G tags in INFO as discussed in samtools#272

c0b6d46

pd3 added a commit to pd3/hts-specs that referenced this issue May 4, 2018

Clarify Number=G tags in INFO as discussed in samtools#272

14f2123

LiterallyUniqueLogin mentioned this issue Jul 1, 2020

Proposal: Number = P (ploidy) #515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number=G in INFO records #272

Number=G in INFO records #272

bioinformed commented Dec 26, 2017

yfarjoun commented Dec 27, 2017

yfarjoun commented Dec 27, 2017

pd3 commented Jan 2, 2018

bioinformed commented Jan 2, 2018

pd3 commented Jan 2, 2018

bioinformed commented Jan 2, 2018

yfarjoun commented Jan 2, 2018

pd3 commented Jan 2, 2018

yfarjoun commented Jan 2, 2018

pd3 commented Jan 2, 2018

yfarjoun commented Jan 2, 2018

bioinformed commented Jan 2, 2018

konradjk commented Jan 2, 2018

bioinformed commented Jan 2, 2018

cseed commented Jan 2, 2018

lh3 commented Jan 2, 2018

lh3 commented Jan 2, 2018

konradjk commented Jan 2, 2018

d-cameron commented Jan 3, 2018

pd3 commented Feb 8, 2018 •

edited

Loading

konradjk commented Feb 8, 2018

cyenyxe commented May 1, 2018

pd3 commented May 2, 2018

yfarjoun commented May 2, 2018

ctsa commented May 2, 2018

d-cameron commented May 2, 2018

pd3 commented May 3, 2018 •

edited

Loading

d-cameron commented May 3, 2018

pd3 commented May 3, 2018

Number=G in INFO records #272

Number=G in INFO records #272

Comments

bioinformed commented Dec 26, 2017

yfarjoun commented Dec 27, 2017

yfarjoun commented Dec 27, 2017

pd3 commented Jan 2, 2018

bioinformed commented Jan 2, 2018

pd3 commented Jan 2, 2018

bioinformed commented Jan 2, 2018

yfarjoun commented Jan 2, 2018

pd3 commented Jan 2, 2018

yfarjoun commented Jan 2, 2018

pd3 commented Jan 2, 2018

yfarjoun commented Jan 2, 2018

bioinformed commented Jan 2, 2018

konradjk commented Jan 2, 2018

bioinformed commented Jan 2, 2018

cseed commented Jan 2, 2018

lh3 commented Jan 2, 2018

lh3 commented Jan 2, 2018

konradjk commented Jan 2, 2018

d-cameron commented Jan 3, 2018

pd3 commented Feb 8, 2018 • edited Loading

konradjk commented Feb 8, 2018

cyenyxe commented May 1, 2018

pd3 commented May 2, 2018

yfarjoun commented May 2, 2018

ctsa commented May 2, 2018

d-cameron commented May 2, 2018

pd3 commented May 3, 2018 • edited Loading

d-cameron commented May 3, 2018

pd3 commented May 3, 2018

pd3 commented Feb 8, 2018 •

edited

Loading

pd3 commented May 3, 2018 •

edited

Loading