-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolving ambiguities introduced when supporting colons in hg38 names #291
Comments
Unfortunately the SAM specification allows pretty much anything in a reference name. The first character is not allowed to be |
perhaps we could limit the name-space for contig names in a way that
doesn't interfere with current usage and would allow for this suggestion.
…On Thu, Mar 1, 2018 at 7:30 AM, daviesrob ***@***.***> wrote:
Unfortunately the SAM specification
<http://samtools.github.io/hts-specs/SAMv1.pdf> allows pretty much
anything in a reference name. The first character is not allowed to be *
or =. The regex for the rest is /[!-~]*/ which includes all printable
non-whitespace characters in US-ASCII.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#291 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0q7dK-8OxAqVu1jXIIYCLfTrqKV0ks5tZ-n7gaJpZM4SYLZD>
.
|
As added to #193 also, comma is already a problem for us in SAM world due to SA tags and the new OA tag. Given it's currently unused in all the legal fai files we found, I'd be in favour of simply banning comma from contig names so we can get some sanity back in the world. Edit: please could others also do a scan through all their local reference files looking for meta-characters. It's possible it's never been used here, but is in active use somewhere else. I wonder if it's possible to trawl EBI and NCBI achives for SAM and VCF headers? |
I should mention my previous table of punctuation characters found in reference files:
So comma isn't too bad (all the commas came from a single badly-formatted file), but dot is not a good choice. Semicolon looks fairly safe at the moment. It would be a very good idea if everyone with access to large collections of reference sequences did a similar analysis. We can compare results and see if these are typical or not. |
Here are our results:
I used the following: while read file; do
grep '@SQ' $file | cut -f2 | sed 's/SN://; s/[A-Za-z0-9]*//g; s/\(.\)/\1\n/g' ;
done < <(find /references | grep dict ) | sort | uniq -c > operators_in_references |
Thanks Yossi. Good command line hint too. Any clue where the commas come from? We saw some, but they were a totally borked fasta file produced by someones file parsing going wrong. As it stands now we know comma will already produce problems with SAM due to SA tags. Still no semicolon, although I'd rather not introduce a different list separator. |
@jmarshall @jkbonfield can anyone perform the same exercise on the CRAM reference registry? Or the NCBI reference sequence database? Could be fun. |
Not us - that's an EBI internal thing. Also the cram reference registry is accessed by md5 only and has no name associated with the sequences. However EBI have many other databases of sequence data so maybe they can screen those. |
we could disallow (:|-)[0-9]* at the end of contig names. this might not break current references and so we might be able to get it into the sam-spec. It will allow non-ambiguous parsing for the purpose of intervals. a similar trick might remove the ambiguity regarding breakpoint... |
That works, although I think it bumps the level of parser required a bit higher - it now needs backtracking by potentially many characters during tokenisation of a region. Pragmatically that may not matter if we're just coding it ad-lib rather than using a formal grammar. |
@yfarjoun My regex skills may be a bit rusty, but wouldn't that make hg38 contig names illegal again? The example initially provided in #258 is:
|
hmmm. yes. Though we could capitalize on the '0' in 01 and disallow the ending to be of the form |
The offending GRCh38 decoy file also has "HLA-DRB1*12:17" which breaks this idea. Basically it's bust whatever we do unless we simply have code in the command line tools that look first a contig matching the full string and if not found then attempt to process as a region and repeat again. |
These are the counts from all genome assembly sequences (contigs, scaffold, chromosomes) submitted to ENA since the current assembly submission pipeline was introduced in 2013:
|
Just for clarity: what is the ambiguity that the proposal introduces? If we force the spec to resolve in favor of I was under the impression that the discussions regarding contig:position-position were about the tooling more generally (e.g. specifying intervals in various tools) thus technically out of scope for the VCF specifications themselves. |
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
I have re-read the whole hg38 contig name pull request thread #258 and wanted to suggest some related, longer term ideas without polluting it.
In that thread, most people are in favor of supporting colons in contig names, and making the parsers resolve breakend ambiguities in favor of contig:position or contig:position-position.
This is a legitimate use case, but I am concerned because we have been trying to reduce the ambiguities in the spec, and this change will introduce a new one. Would people be open to slightly modify the breakend notation in future versions of the spec, for instance by replacing the colon with a dot or comma? I think these characters are not allowed in contig names according to the SAM spec, but could someone please confirm it?
The representation would change from something like:
To something like the following:
The text was updated successfully, but these errors were encountered: