Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Restrictions on Contig Names not Strict Enough #167

Closed
jamesemery opened this issue Sep 2, 2016 · 3 comments · Fixed by #379
Closed

New Restrictions on Contig Names not Strict Enough #167

jamesemery opened this issue Sep 2, 2016 · 3 comments · Fixed by #379
Labels

Comments

@jamesemery
Copy link

I see that in section 1.4.7 of VCFv4.3, it states that contig names can mostly follow restrictions placed upon them by the SAM spec except for a few explicitly disallowed characters. It would seem that the restricted characters are not enough to preserve parsing of VCF files. Specifically, the SAM format restrictions specify (”[!-)+-<>-][!-]”) which translates into any printable character so long as it isn't in the first place. We place some restrictions on this, namely specifying that the characters ”<>[]:” are additionally non-valid. Since the contig name has to appear in a header line this runs the risk of making it impossible to parse a header string. An example contig header line might be:

##contig=<ID=ctg1,length=81195210,Description="Foo">

Unfortunately both the = and the , characters are legal according to the new restrictions so technically the string "ctg1,length=81195210" COULD be a valid contig name, which means that parsing is ambiguous. Something needs to be done about these two characters at least, the way I see it there are a couple options:

  1. Restrict the contig name to disallow = and , would fix this problem at the expense of further restricting contigs allowed by the SAM spec.
  2. Escape the ID field in the contig header line somehow, for example with a second set of <> or "" characters making the header line look something like this:
    ##contig=<ID=<ctg1>,length=81195210,Description="Foo"> which would break forward compatibility and add some complexity into parsing.
  3. Encode dangerous characters in the header somehow, perhaps by percent encoding of the header contig ID field. The VCFv4.3 spec only seems to explicitly specify percent encoding for INFO and FORMAT field values. Perhaps it could be expanded/clarified to also apply to header ID line values (except for INFO and FORMAT where a regex is already imposed)?
@yfarjoun
Copy link
Contributor

yfarjoun commented Sep 2, 2016

For what it's worth, I support the first option. and I would even strengthen that to support disallowing other "annoying" characters (annoying for command-line parsing purposes) such as:

"`:*()?|\/!&

In fact, I would support disallowing all non-alphanumerics except for a select few:

-_+

I know that this is incompatible with SAM, and we should find a way to deal with that.

@tfenne
Copy link
Member

tfenne commented Sep 2, 2016

I think that before making anything more restrictive it's worth doing a survey of what's actually being used. E.g. disallowing anything that's in the broadly used hs38DH (including : and *) would cause a lot of people a lot of unnecessary pain.

@yfarjoun
Copy link
Contributor

yfarjoun commented Sep 2, 2016

I am, of-course well aware of the : and * in hs38DH...and would love for there to be a solution that would not cause a lot of pain....I guess I was venting some frustration about that.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 11, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jan 15, 2019
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
@jmarshall jmarshall added the vcf label Feb 4, 2019
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 5, 2019
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Apr 9, 2019
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 29, 2019
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue Feb 6, 2020
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
@jmmut jmmut closed this as completed in #379 Mar 3, 2020
jmmut pushed a commit that referenced this issue Mar 3, 2020
…llow colons) (#379)

* Allow colons in VCF Contig IDs: breakend notation is unambiguous

Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of #124. Fixes #258. Closes #291.

* Restrict allowed VCF Contig ID chars to those allowed in SAM RNAMEs

Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes #124 and fixes #167 for VCF; their SAM aspects were previously
fixed by PR #333.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants