Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing VCF spec is incompatible with certain hg38 contig names #258

Closed
droazen opened this issue Oct 25, 2017 · 25 comments
Closed

Existing VCF spec is incompatible with certain hg38 contig names #258

droazen opened this issue Oct 25, 2017 · 25 comments
Assignees

Comments

@droazen
Copy link

droazen commented Oct 25, 2017

In hg38 sequence dictionaries we frequently encounter contigs such as this one:

@SQ	SN:HLA-A*01:01:01:01	LN:3503	M5:01cd0df602495b044b2c214d69a60aa2	AS:38	UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta	SP:Homo sapiens

These contigs include the : character in their names, which is problematic for VCF since section 1.6.1 of the current spec specifically forbids colons in contig names:

The colon symbol (:) must be absent from all
chromosome names to avoid parsing errors when dealing with breakends.

This is inconsistent with the SAM spec, which allows colons in contig names (they are required to match the regular expression \*|[!-()+-<>-~][!-~]*).

With increasing adoption of hg38 (particularly in the last year or so), this is becoming a bit of a crisis. The maintainers of the VCF spec met recently and agreed that this issue needs to be resolved in the next version of the spec (in addition to agreeing on an officially-sanctioned practical workaround that can be used with the current version of the spec).

I see a couple of possible options here:

  1. Remove the ban on colons in contig names in the next version of the spec (VCF 4.4). This would resolve the current incompatibility between the VCF and SAM specs, and since it constitutes a relaxation of an existing restriction rather than imposing a new restriction, it should not cause backwards-compatibility issues.

  2. Keep the restriction, but require that colons in contig names be escaped within VCF records, and automatically encoded/decoded by parsers (rather than escaped manually by users, which would be impractical). This option is consistent with section 1.2 of the existing spec:

Characters with special meaning (such as field delimiters ’;’ in INFO or ’:’ FORMAT fields) 
must be represented using the capitalized percent encoding:
%3A : (colon)

I think whatever approach we choose, tools will have to be updated to accept intervals such as "HLA-A*01:01:01:01:1-5", and use the sequence dictionary to resolve the parsing ambiguity introduced by the colons. Requiring users to escape these contig names manually seems far too onerous.

Thoughts on this issue? I'm hoping to trigger a discussion that will lead to a consensus + resolution in time for VCF 4.4, plus agreement on a workaround that can be used within the existing spec.

@yfarjoun
Copy link
Contributor

So, the issue with the colon (AFAIK) is that people like to be able to write

contig:start-end

and that it would be parsed in the logical way...if you are allowed to have colons and hyphens in the contig names...then the parsing fails...obviously.

I am not sure I understand how allowing colons formally helps this.

If we do immediately encode the colon and hypen (which I would argue also needs protection) using % encoding, do we then have to use that in the command-line when we want to refer to contig:with:hyphens:50-200? (meaning either the entire contig called contig:with:hyphens:50-200 or positions 50 through 200 of the other contig contig:with:hyphens) if not, how will the parsing work?

@droazen
Copy link
Author

droazen commented Oct 25, 2017

@yfarjoun The idea is that tools could be updated to use the sequence dictionary when parsing intervals of the form "contig:start-end". This would resolve the parsing ambiguity while not imposing requirements on the user to escape these contig names manually.

@yfarjoun
Copy link
Contributor

I don't see how this would happen... I have a dictionary that contains the following two contigs:

chr1:10-20
chr1

and I run GATK as so:

java -jar GATK.jar PrintReads -L chr1:10-20 ....

What will happen?

@droazen
Copy link
Author

droazen commented Oct 25, 2017

In that pathological (and extremely hypothetical) case the tool could throw, or handle it in a way that seems best to the tool author. But using sequence dictionary lookups when parsing intervals would solve the actual problem we have with real hg38 contig names.

Note that this issue of how tools should do their parsing is somewhat orthogonal to the issue with the VCF spec itself. Since making the SAM spec more restrictive is unlikely to happen, it seems better to make the VCF spec more permissive, and handle the parsing issue in the way described above.

@yfarjoun
Copy link
Contributor

I know it was hypothetical, and pathological, but if we make a change to the spec, we should define what happens in the corner-cases as well. For example, we could define in the spec that the first contig in the dictionary that matches a certain regex prescribes the parsing.

@droazen
Copy link
Author

droazen commented Oct 25, 2017

Parsing and representation of intervals is outside of the scope of the VCF spec. It is up to each toolkit to decide how to represent and parse intervals, including use of : as a delimiter in interval expressions.

For the VCF spec itself, which is intended to be the focus of this ticket, the issue is the current inconsistency in allowed characters in contig names vs. the SAM spec. Since hg38 contig names are unfortunately here to stay, and since it is almost impossible to make breaking changes to the SAM spec (unless BAML/BAM2 actually happens, of course!), the most pragmatic approach to resolving the inconsistency seems to me to be to relax the VCF spec. The : character does not cause parsing problems for the contig field within VCF records themselves, and the parsing problems it does cause for other formats (such as GATK/Picard interval files) can be managed/worked around.

@pd3
Copy link
Member

pd3 commented Oct 25, 2017

The dictionary of contigs is optional in VCF, better if the solution did not depend on it. As mentioned above, including colons in chromosome names is not a problem for the format itself, only for the tools.
It does not help (and is completely unnecessary) to escape the colons within the VCF files - the rows are tab-delimited and there is no ambiguity. The only problem is what should users type on the command line. This can be completely arbitrary. For bcftools/samtools/htslib users I'd go with the convention of escaping the colons in contigs with backslash, as is the common best practice in unix world.

@d-cameron
Copy link
Contributor

d-cameron commented Oct 25, 2017

As mentioned above, including colons in chromosome names is not a problem for the format itself, only for the tools.

Parsing and representation of intervals is outside of the scope of the VCF spec. It is up to each toolkit to decide how to represent and parse intervals, including use of : as a delimiter in interval expressions.

Breakend notation requires the parsing of intervals using the colon separator. It is not outside the scope of the VCF specifications and is a problem with parsing of the VCF files themselves.

The ambiguity in VCF (and the other tools) can be resolved by always parsing in favour of contig:position or contig:position-position (parsing in favour of the other direction doesn't work in the general case). HLA-A*01:01:01:01 would be parsed as HLA-A*01:01:01:1. If a user wants to use the entire contig of an unfortunately named contig they would have to specify the entire contig interval. In the case of HLA-A*01:01:01:01, they would have to refer to it as HLA-A*01:01:01:01:1-3503 for it to be parsed as intended. It's not difficult for tools to always use the contig:position/contig:position-position notations as unqualified contig is just a shorthand for contig:start-end.

For VCF, a breaking change requiring contig:position notation for all breakends would simplify the parsing as it would no longer be ambiguous.

chr1:10-20
chr1
java -jar GATK.jar PrintReads -L chr1:10-20 ....

I would strongly urge GATK (and other tools) to interpret this as chr1:10-20 as the first contig can be referred to as chr1:10-20:1-11 but there's no way to refer to that interval on chr1 without introducing a new notation.

@yfarjoun
Copy link
Contributor

yfarjoun commented Oct 26, 2017 via email

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 26, 2017

SAM added the AN field for alternative name. This was largely a way of resolving the chr1 vs Chr1 vs 1 debacle, but actually it also solves the chr1:10-20 vs chr1:10-20 debacle too if we just state that colons are always interpreted as ranges and if you wish to refer to that chromosome name then you need to add an AN field and use that instead.

That said, looking it up in the header to see if the range request matches something already in there is a pragmatic workaround.

@pd3
Copy link
Member

pd3 commented Oct 27, 2017

@d-cameron I forgot about the breakends, good point.

Any lookup mechanism does not address the problem that contigs are optional in VCF. Also does not help when ridiculous contig names are used, as pointed by @yfarjoun. For example imagine two contigs named ctg1 and ctg1:1-10, then ctg1:1-10 is ambigous.

A simple escaping mechanism is needed. Percent encoding is a possibility. Or using a backslash. I personally favor the latter, as it is much simpler to parse. However, I'd accept the percent encoding as well.

@magicDGS
Copy link
Member

I agree with @yfarjoun and @pd3 - the "pathological" case of ctg1, ctg1:1-10 and ctg1:1-10 should be taken into consideration here. I vote for a more stringent contig name in SAM/BAM (and even FASTA, but that format is not in hts-specs) for consistency, and the BAM2 is a good oportunity for making this change. Otherwise, it would be a proliferation of "pathological" cases coming from FASTA files with headers containing a region and not only for hg38.

@tfenne
Copy link
Member

tfenne commented Oct 31, 2017

I would be strongly against making the BAM reference naming more stringent. Prior to hs38DH that would have been a good idea, but there is now a lot of hs38DH data out in the world and I think that ship has already sailed. It's one thing to have a new version of the BAM spec, it's another altogether to make it such that everyone would either have to a) change the reference they are using or b) change all their tooling to use AN to do name resolution.

I agree with the statement above that this is a problem for the tooling, not for the specifications (other than that VCF and BAM should match, and in my opinion by making VCF more lenient). In practice a lot of this can be solved be, e.g. always splitting on the last :, and for contigs that contain colons, requiring an input like HLA-A*01:01:01:01:1- to mean HLA-A*01:01:01:01 from base 1 until the end, rather than using the bare HLA-A*01:01:01:01.

@d-cameron
Copy link
Contributor

Incorporating it in BAM2 is problematic as it is a change to SAM, not BAM. I would vote to resolve all ambiguities in favour of the position qualified interpretation. SAM doesn't care, VCF can be salvaged by fixing a few edge cases. As far as I'm aware, no tool actually write BND notation record without positional qualification so, in practice, the change to VCF is a bit of a non-issue. It's all in how the toolchains interpret colon-containing contigs and this can be resolved without resorting to any contig encoding or escaping.

@pd3 Neither backslash nor percentage encoding are backwards compatible as both are valid contig characters according to the SAM regex. I would very much like to avoid having a third escaping mechanism introduced into the VCF specifications. We're struggling with the two we already have (I'm not aware of any implementations that are fully compliant parsers of the two escaping mechanisms that are already in the specifications).

@pd3
Copy link
Member

pd3 commented Nov 2, 2017

@d-cameron You say what you do not want. But what is your proposal then?

@jkbonfield
Copy link
Contributor

Largely it is a tool chain issue as the tools mix chromosome and region into a single argument. This is historical, but it was luck that we ended up this way. I have seen names like chr10:1000-2000 and that name is chosen because it is the nomenclature for specifying the region, so no matter what nomenclature we used we'd have hit this issue assuming it's a single string.

I favour a workaround strategy, which is a common way to describe "all"; eg for reference ref have ref:, ref:1- or ref:1+ to mean base 1 to end. Then if we have reference chr10 as well as chr10:1000-2000 and we wish to refer to the entirety of the latter, we specify chr10:1000-2000: to disambiguate it. The region parsers typically work from last colon, which would then make this unambiguous, until someone adds a name ending in colon in there but we can ban that now while it's not in active use or just accept we'd need double-colon for that case.

It's better than escape mechanism because they aren't completely backwards compatible.

@pd3
Copy link
Member

pd3 commented Nov 2, 2017

@jkbonfield I hate to point out that this also will not work with insane contig names ending with colon.

The only general mechanism is escaping, even if the specification should introduce a new escape character now and break backwards compatibility.

@jkbonfield
Copy link
Contributor

Oh god, are people already using contig names ending in colon? What's the world: coming to?

Escape characters work provided the escape character also isn't already being used anywhere, or if we simply don't care about backwards compatibility, but that's hard to swallow.

@pd3
Copy link
Member

pd3 commented Nov 2, 2017

People use all sorts of weird stuff. The most recent fun discovery was the pipe character in the contig name.

@droazen
Copy link
Author

droazen commented Nov 2, 2017

This discussion so far has focused on parsing issues in interval representation. I'd like for a moment to bring it back to the CHROM field in VCF. With the current ban on colons in contig names, we cannot (legally) store these hg38 HLA contigs in a VCF. Does anyone object to the specific proposal of allowing colons in the VCF CHROM field for VCF 4.4, so that these contigs can be stored (putting aside the interval parsing issue temporarily)?

@d-cameron
Copy link
Contributor

d-cameron commented Nov 2, 2017

I've just rechecked the VCF specs and s5.4 requires "chr:pos" notation so there isn't a BND parsing ambiguity in the specifications after all. : can be escaped in the INFO and FORMAT fields, escape isn't required in the ##contig header so it's ok there as well. It looks like the only change required for VCF to update the s1.4.7 restriction and remove the sentence from s1.6.1.1. The fact that 1.4.7 and 1.6.1.1 don't match looks like an oversight in the specs.

Edit: There might be an issue if the specs allow alt alleles of the for N<contig><contig> but the specs are unclear on this and I never got an answer when I first asked about that years ago. I've been assuming that such alleles are not valid.

@d-cameron
Copy link
Contributor

@pd3 I've been advocating for the position I outlined in my first contribution to this thread: always parsing and resolving ambiguities in favour of contig:position or contig:position-position. I believe this is the same position as taken by @tfenne.

@pd3
Copy link
Member

pd3 commented Nov 3, 2017

@d-cameron I see, overlooked this, sorry. I could live with requiring explicit range, fine with me. Still I do favor some sort of escaping more, not to the point of blocking your proposal though.

@d-cameron
Copy link
Contributor

As further support for my proposal: I actually had a user encounter the HLA issue with GRIDSS some months back (PapenfussLab/gridss#87). Resolving parsing in favour of contig:position was sufficient for GRIDSS (which uses VCF breakend notation exclusively) to work with hg38 HLA contigs.

@jmarshall
Copy link
Member

Colons in contig names cause problems for tools' user interfaces, but I think those problems can be overcome satisfactorily with simple heuristics based on the contig name dictionary: see my long comment on #124.

What remaining parsing problems are there in parsing VCF format with colons in contig names?

The spec says

excluding the characters <>[]:* to avoid clashes with symbolic alleles and breakend notation

However as @d-cameron noted the breakend notation is still unambiguous as the :pos part is always present. While a contig name containing colons means you can't use the simple left-to-right parsing of this, it's not much harder to read to the final bracket and then back up to the rightmost colon.

Which other VCF/BCF fields would colon-containing contig names need to be escaped in? It seems to me that relaxing the ban on colons wouldn't cause too many problem for the format itself.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 11, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 15, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 5, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Apr 9, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 29, 2019
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 6, 2020
Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
@jmmut jmmut closed this as completed in 0ef79c0 Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants