-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Existing VCF spec is incompatible with certain hg38 contig names #258
Comments
So, the issue with the colon (AFAIK) is that people like to be able to write
and that it would be parsed in the logical way...if you are allowed to have colons and hyphens in the contig names...then the parsing fails...obviously. I am not sure I understand how allowing colons formally helps this. If we do immediately encode the colon and hypen (which I would argue also needs protection) using |
@yfarjoun The idea is that tools could be updated to use the sequence dictionary when parsing intervals of the form "contig:start-end". This would resolve the parsing ambiguity while not imposing requirements on the user to escape these contig names manually. |
I don't see how this would happen... I have a dictionary that contains the following two contigs:
and I run GATK as so:
What will happen? |
In that pathological (and extremely hypothetical) case the tool could throw, or handle it in a way that seems best to the tool author. But using sequence dictionary lookups when parsing intervals would solve the actual problem we have with real hg38 contig names. Note that this issue of how tools should do their parsing is somewhat orthogonal to the issue with the VCF spec itself. Since making the SAM spec more restrictive is unlikely to happen, it seems better to make the VCF spec more permissive, and handle the parsing issue in the way described above. |
I know it was hypothetical, and pathological, but if we make a change to the spec, we should define what happens in the corner-cases as well. For example, we could define in the spec that the first contig in the dictionary that matches a certain regex prescribes the parsing. |
Parsing and representation of intervals is outside of the scope of the VCF spec. It is up to each toolkit to decide how to represent and parse intervals, including use of For the VCF spec itself, which is intended to be the focus of this ticket, the issue is the current inconsistency in allowed characters in contig names vs. the SAM spec. Since hg38 contig names are unfortunately here to stay, and since it is almost impossible to make breaking changes to the SAM spec (unless |
The dictionary of contigs is optional in VCF, better if the solution did not depend on it. As mentioned above, including colons in chromosome names is not a problem for the format itself, only for the tools. |
Breakend notation requires the parsing of intervals using the colon separator. It is not outside the scope of the VCF specifications and is a problem with parsing of the VCF files themselves. The ambiguity in VCF (and the other tools) can be resolved by always parsing in favour of contig:position or contig:position-position (parsing in favour of the other direction doesn't work in the general case). For VCF, a breaking change requiring contig:position notation for all breakends would simplify the parsing as it would no longer be ambiguous.
I would strongly urge GATK (and other tools) to interpret this as |
One thing we do is to allow "contig:1+" notation to mean "from position 1
to the end"...this makes it easier to programmatically specify "the entire
contig" without having to look in the dictionary to find what the length
is...
…On Wed, Oct 25, 2017 at 11:59 PM, Daniel Cameron ***@***.***> wrote:
As mentioned above, including colons in chromosome names is not a problem
for the format itself, only for the tools.
Parsing and representation of intervals is outside of the scope of the VCF
spec. It is up to each toolkit to decide how to represent and parse
intervals, including use of : as a delimiter in interval expressions.
Breakend notation requires the parsing of intervals using the colon
separator. It is not outside the scope of the VCF specifications and is a
problem with parsing of the VCF files themselves.
The ambiguity in VCF (and the other tools) can be resolved by always
parsing in favour of contig:position or contig:position-position (parsing
in favour of the other direction doesn't work in the general case). HLA-A*01:01:01:01
would be parsed as HLA-A*01:01:01:1. If a user wants to use the entire
contig of an unfortunately named contig they would have to specify the
entire contig interval. In the case of HLA-A*01:01:01:01, they would have
to refer to it as HLA-A*01:01:01:01:1-3503 for it to be parsed as
intended. It's not difficult for tools to always use the
contig:position/contig:position-position notations as unqualified contig
is just a shorthand for contig:start-end.
chr1:10-20
chr1
java -jar GATK.jar PrintReads -L chr1:10-20 ....
I would strongly urge GATK (and other tools) to interpret this as chr1:
10-20 as the first contig can be referred to as chr1:10-20:1-11.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#258 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0pxVbm3afWorypQsUif9-V6pRf7eks5sv6ElgaJpZM4QGZcp>
.
|
SAM added the AN field for alternative name. This was largely a way of resolving the chr1 vs Chr1 vs 1 debacle, but actually it also solves the chr1:10-20 vs chr1:10-20 debacle too if we just state that colons are always interpreted as ranges and if you wish to refer to that chromosome name then you need to add an AN field and use that instead. That said, looking it up in the header to see if the range request matches something already in there is a pragmatic workaround. |
@d-cameron I forgot about the breakends, good point. Any lookup mechanism does not address the problem that contigs are optional in VCF. Also does not help when ridiculous contig names are used, as pointed by @yfarjoun. For example imagine two contigs named A simple escaping mechanism is needed. Percent encoding is a possibility. Or using a backslash. I personally favor the latter, as it is much simpler to parse. However, I'd accept the percent encoding as well. |
I agree with @yfarjoun and @pd3 - the "pathological" case of |
I would be strongly against making the BAM reference naming more stringent. Prior to hs38DH that would have been a good idea, but there is now a lot of hs38DH data out in the world and I think that ship has already sailed. It's one thing to have a new version of the BAM spec, it's another altogether to make it such that everyone would either have to a) change the reference they are using or b) change all their tooling to use I agree with the statement above that this is a problem for the tooling, not for the specifications (other than that VCF and BAM should match, and in my opinion by making VCF more lenient). In practice a lot of this can be solved be, e.g. always splitting on the last |
Incorporating it in BAM2 is problematic as it is a change to SAM, not BAM. I would vote to resolve all ambiguities in favour of the position qualified interpretation. SAM doesn't care, VCF can be salvaged by fixing a few edge cases. As far as I'm aware, no tool actually write BND notation record without positional qualification so, in practice, the change to VCF is a bit of a non-issue. It's all in how the toolchains interpret colon-containing contigs and this can be resolved without resorting to any contig encoding or escaping. @pd3 Neither backslash nor percentage encoding are backwards compatible as both are valid contig characters according to the SAM regex. I would very much like to avoid having a third escaping mechanism introduced into the VCF specifications. We're struggling with the two we already have (I'm not aware of any implementations that are fully compliant parsers of the two escaping mechanisms that are already in the specifications). |
@d-cameron You say what you do not want. But what is your proposal then? |
Largely it is a tool chain issue as the tools mix chromosome and region into a single argument. This is historical, but it was luck that we ended up this way. I have seen names like chr10:1000-2000 and that name is chosen because it is the nomenclature for specifying the region, so no matter what nomenclature we used we'd have hit this issue assuming it's a single string. I favour a workaround strategy, which is a common way to describe "all"; eg for reference ref have ref:, ref:1- or ref:1+ to mean base 1 to end. Then if we have reference chr10 as well as chr10:1000-2000 and we wish to refer to the entirety of the latter, we specify chr10:1000-2000: to disambiguate it. The region parsers typically work from last colon, which would then make this unambiguous, until someone adds a name ending in colon in there but we can ban that now while it's not in active use or just accept we'd need double-colon for that case. It's better than escape mechanism because they aren't completely backwards compatible. |
@jkbonfield I hate to point out that this also will not work with insane contig names ending with colon. The only general mechanism is escaping, even if the specification should introduce a new escape character now and break backwards compatibility. |
Oh god, are people already using contig names ending in colon? What's the world: coming to? Escape characters work provided the escape character also isn't already being used anywhere, or if we simply don't care about backwards compatibility, but that's hard to swallow. |
People use all sorts of weird stuff. The most recent fun discovery was the pipe character in the contig name. |
This discussion so far has focused on parsing issues in interval representation. I'd like for a moment to bring it back to the |
I've just rechecked the VCF specs and s5.4 requires "chr:pos" notation so there isn't a BND parsing ambiguity in the specifications after all. Edit: There might be an issue if the specs allow alt alleles of the for |
@d-cameron I see, overlooked this, sorry. I could live with requiring explicit range, fine with me. Still I do favor some sort of escaping more, not to the point of blocking your proposal though. |
As further support for my proposal: I actually had a user encounter the HLA issue with GRIDSS some months back (PapenfussLab/gridss#87). Resolving parsing in favour of contig:position was sufficient for GRIDSS (which uses VCF breakend notation exclusively) to work with hg38 HLA contigs. |
Colons in contig names cause problems for tools' user interfaces, but I think those problems can be overcome satisfactorily with simple heuristics based on the contig name dictionary: see my long comment on #124. What remaining parsing problems are there in parsing VCF format with colons in contig names? The spec says
However as @d-cameron noted the breakend notation is still unambiguous as the Which other VCF/BCF fields would colon-containing contig names need to be escaped in? It seems to me that relaxing the ban on colons wouldn't cause too many problem for the format itself. |
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
In hg38 sequence dictionaries we frequently encounter contigs such as this one:
These contigs include the
:
character in their names, which is problematic for VCF since section1.6.1
of the current spec specifically forbids colons in contig names:This is inconsistent with the SAM spec, which allows colons in contig names (they are required to match the regular expression
\*|[!-()+-<>-~][!-~]*
).With increasing adoption of hg38 (particularly in the last year or so), this is becoming a bit of a crisis. The maintainers of the VCF spec met recently and agreed that this issue needs to be resolved in the next version of the spec (in addition to agreeing on an officially-sanctioned practical workaround that can be used with the current version of the spec).
I see a couple of possible options here:
Remove the ban on colons in contig names in the next version of the spec (VCF 4.4). This would resolve the current incompatibility between the VCF and SAM specs, and since it constitutes a relaxation of an existing restriction rather than imposing a new restriction, it should not cause backwards-compatibility issues.
Keep the restriction, but require that colons in contig names be escaped within VCF records, and automatically encoded/decoded by parsers (rather than escaped manually by users, which would be impractical). This option is consistent with section 1.2 of the existing spec:
I think whatever approach we choose, tools will have to be updated to accept intervals such as "HLA-A*01:01:01:01:1-5", and use the sequence dictionary to resolve the parsing ambiguity introduced by the colons. Requiring users to escape these contig names manually seems far too onerous.
Thoughts on this issue? I'm hoping to trigger a discussion that will lead to a consensus + resolution in time for VCF 4.4, plus agreement on a workaround that can be used within the existing spec.
The text was updated successfully, but these errors were encountered: