Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? #239

Closed
hayatoito opened this issue Jul 6, 2015 · 61 comments

Comments

@hayatoito
Copy link
Contributor

Title: [Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? (bugzilla: 22056)

Migrated from: https://www.w3.org/Bugs/Public/show_bug.cgi?id=22056


comment: 0
comment_url: https://www.w3.org/Bugs/Public/show_bug.cgi?id=22056#c0
Dominic Cooney wrote on 2013-05-16 06:29:42 +0000.

"is a sequence of alphanumeric ASCII characters"

This is confusing. NCName [1] includes combining characters and extenders that are not ASCII characters. These should be allowed, because custom element names MUST match the NCName production and there is no restriction on the character set.

I think "is a sequence of alphanumeric ASCII characters" should be "MUST be a sequence of ASCII characters".

[1] http://www.w3.org/TR/1999/REC-xml-names-19990114/#NT-NCName\

@rniwa
Copy link
Collaborator

rniwa commented Mar 1, 2016

The current specification doesn't even mention ASCII characters anywhere:

https://w3c.github.io/webcomponents/spec/custom/#dfn-custom-element-type

The custom element type identifies a custom element interface and is a sequence of characters that must match the NCName production [XML-NAMES], must contain a U+002D HYPHEN-MINUS character, and must not contain any uppercase ASCII letters [HTML].

I think it makes sense to restrict the tag name to ascii letters at least in v1.

@rniwa
Copy link
Collaborator

rniwa commented Mar 1, 2016

Any opinions? @annevk @travisleithead @hober

@notwaldorf
Copy link

I'd like to cast my vote for non-ascii in custom elements names! Emoji aside (https://jsbin.com/buzegi/edit?html,output) which is kind of cool, I think Kanji characters in tag names is a real use case 😊

Does the parser care if the tag name is non-ascii? Like, what makes it a hard problem (out of curiosity)?

@rniwa
Copy link
Collaborator

rniwa commented Mar 1, 2016

See the issue #177. There are disagreements on the exact set of characters allowed in tag names.

Someone needs to investigate the issue and come up with a safe/correct subset of characters that can be used in custom elements.

I'm suggesting to restrict it to only ASCII characters in v1 since it's always safe to expand the set of characters being allowed latter once someone has done that work but not vice versa.

@rniwa
Copy link
Collaborator

rniwa commented Mar 1, 2016

By the way, if we're allowing exotic non-ASCII characters like emoji, we probably don't need the hyphen requirement in those tag names since the requirement exists for the forward compatibility with future HTML documents, and I don't think we'd ever add an HTML element with an emoji in its tag name.

For example, '-' almost never appears in Chinese/Japanese, and it would look absolutely awful between Hanzi/Kanji/Katakana/Hiragana/etc...: Bad: マイ-エレメント Good: マイエレメント. Alternatively, we should allow full-width equivalent of hyphen such as http://unicode-table.com/en/30FB/.

@rniwa
Copy link
Collaborator

rniwa commented Mar 1, 2016

Also, if we do allow accented characters, would we allow capital accented letters? e.g. È is allowed but È (È) is disallowed in tag names? That would be rather confusing.

@annevk
Copy link
Collaborator

annevk commented Mar 1, 2016

Per the HTML parser a tag name has to start with [a-z][A-Z]. However, once you get to the "tag name state", anything goes, except for ASCII whitespace, "/", ">", and U+0000.

I would be okay with requiring ASCII lowercase (with at least one hyphen) as a start and then go from there. I would also be fine with allowing more, but I don't think we should do anything that requires changing the rule that it starts with an ASCII alpha.

@annevk annevk changed the title [Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? (bugzilla: 22056) [Custom]: "are" custom element names ASCII characters, or MUST they be ASCII characters? Mar 1, 2016
@annevk
Copy link
Collaborator

annevk commented Mar 1, 2016

(See also whatwg/html#721 about making custom elements support self-closing syntax, just like SVG and MathML.)

@chaals
Copy link
Contributor

chaals commented Mar 1, 2016

I'm with Anne. Starting with something like x-джэц or my-日本酒 as legal seems reasonable enough, and leaving the HTML parser alone seems to be a Good Idea™ worth trying out in reality before we start messing with it.

@rniwa
Copy link
Collaborator

rniwa commented Mar 1, 2016

Per the HTML parser a tag name has to start with [a-z][A-Z]. However, once you get to the "tag name state", anything goes, except for ASCII whitespace, "/", ">", and U+0000.

That requirement doesn't exist in the XML parser so I'm inclined to say we should get rid of that requirement in the XML documents because it really doesn't meet the author expectation in non-European languages. This should be an important consideration in the parser extensibility issue #113.

Now, irrespective of HTML or XML documents, it doesn't make any sense to require - in the tag name when the tag name contains non-ASCII letters since there is no conceivable way that would become a forward compatibility problem with the future HTML specifications.

Again, my preference would be to require ASCII lowercase letters for the entire tag name in v1, and extend it carefully in the future. Since, in practice, even authors in Japan, China, etc... are going to use alphanumerical tag names in HTML documents to be consistent with other builtin elements.

Having said all those things, I have see two sensible options:

  1. Require that all characters in a custom element tag name to be ASCII lowercase.
  2. Define a strict subset of what document.createElement, HTML parser, and XML parser support, and then require a custom element tag name consists of only those letters with a leading ASCII character with an additional requirement that - be present when the tag name only contains alphanumeric letters.

@annevk
Copy link
Collaborator

annevk commented Mar 2, 2016

First you say you don't want to constrain XML by the rules of HTML but then you say you want to use a subset of both.

1 coupled with hyphens is definitely the easiest option here.

(XML is constrained by https://www.w3.org/TR/xml-names/#NT-QName whereas createElement() is constrained by https://www.w3.org/TR/xml/#NT-Name. I think the former is a subset of the latter. But XML is also not consistently implemented across engines due to the fifth edition debacle and everyone mostly stopped caring for it.)

@domenic thoughts?

@rniwa
Copy link
Collaborator

rniwa commented Mar 2, 2016

Well, that's because you said you don't want to remove the leading ASCII letter requirement. I would want to remove that requirement in XML documents if we're allowing non-ASCII letters but I'd much rather come up with something everyone agree on than keep debating this.

On that ground, lowercase ASCII letters with hyphens is the easiest one to spec. IMO, we should just go with that and move on. There are too many other important issues to tackle for v1.

@annevk
Copy link
Collaborator

annevk commented Mar 2, 2016

Oh I think you misunderstood. I simply explained how the HTML parser is constrained and that I don't think we should change the HTML parser. I did not mean to imply that should similarly constrain the local name of custom elements. But I'm happy with the simplest thing that could possibly work.

@rniwa
Copy link
Collaborator

rniwa commented Mar 2, 2016

Oh I see. Thanks for the clarification. We should just settle on whatever safest subset we can all agree on for v1.

@domenic
Copy link
Collaborator

domenic commented Mar 2, 2016

I tend to agree with @rniwa that a restriction to ASCII letters in v1 makes sense. On the other hand, I was about to say "we could wait until developers ask for an expanded set and add them in the future", but then I realized @notwaldorf in this thread is a developer doing exactly that. So maybe we should be more permissive.

GIven how XML is a mess and I'd probably make document.defineElement just always fail in XML documents if I could, how about the following?

  • If context object is an XML document, validate that it contains a hyphen and only [a-zA-Z0-9]. (Should we disallow uppercase too?)
  • If context object is a HTML document, validate that:

@annevk
Copy link
Collaborator

annevk commented Mar 2, 2016

We should disallow uppercase in XML. If we want to allow more in HTML, we should use QName from xml-names per createElementNS() since Name allows things that cannot appear in browser-implemented XML. A code point that is not [a-z-] should be enough I think to make it "I am custom".

@domenic
Copy link
Collaborator

domenic commented Mar 2, 2016

If we want to allow more in HTML, we should use QName from xml-names per createElementNS() since Name allows things that cannot appear in browser-implemented XML.

I don't quite follow this reasoning. Why does stuff about browser-implemented HTML impact what we do in HTML documents?

A code point that is not [a-z-] should be enough I think to make it "I am custom".

So <form2> is custom? That's kind of neat.

In any case, you seem to have the best grasp on the restrictions here. With the guiding principles of:

  • We don't care how restrictive we are in XML; restrict as much as you want
  • We do want to allow as much as possible in HTML
  • In HTML, we don't want to require the hyphen if we're using "unusual enough" characters so that we know it's a custom element anyway

would you mind taking over the writing of the exact algorithm? Maybe even do it as a PR after #405 lands.

@rniwa
Copy link
Collaborator

rniwa commented Mar 2, 2016

No, if we're allowing non-ASCII characters, I want to remove the restriction that the leading letter must be a ASCII lowercase in XML documents because it just doesn't work well in languages that don't use latin alphabet.

@domenic
Copy link
Collaborator

domenic commented Mar 2, 2016

@rniwa why do you care about XML documents?

@rniwa
Copy link
Collaborator

rniwa commented Mar 2, 2016

@domenic : I don't care whether I write HTML documents or XML documents. But, as an author, I would rather use XML documents to get around the annoyance that the leading letter must be a ASCII lowercase in Japanese for example. It just doesn't meet author expectation.

@domenic
Copy link
Collaborator

domenic commented Mar 2, 2016

I'm confused. Why don't you just use HTML documents? That restriction doesn't exist there.

@rniwa
Copy link
Collaborator

rniwa commented Mar 2, 2016

@domenic : It totally does. The HTML5 parser requires that the leading letter of every tag name to be ASCII, and such is not the case in the XML parser.

@domenic
Copy link
Collaborator

domenic commented Mar 2, 2016

Ah I see, sorry, I was looking at DOM instead of HTML. That position makes sense... but I assume that restriction is in the parser for a good reason. Probably to deal with things like <! and <[space] and <% and <?. Maybe we can allow [A-Za-z] plus anything greater than U+007F. (Probably with more random small subsets excluded per Name or QName.)

@annevk
Copy link
Collaborator

annevk commented Mar 3, 2016

sigh

I'm not sure I want to work on this, there's five sets of names, as far as I can tell, of which three are used (with two of them arguably wrong):

  1. HTML parser names. [a-z] for the first letter followed by pretty much anything.
  2. xml 4th edition Name. I think most browsers use this for createElement().
  3. xml 5th edition Name. Technically what browsers should use for createElement(), but don't. This is what allows emojis.
  4. xml-names 2nd edition NCName. Used for createElementNS() and elements in the XML parser.
  5. xml-names 3th edition NCName. Should be used for createElementNS() and elements in the XML parser.

I see two sane approaches here:

  1. We restrict custom elements to ASCII alpha + ASCII hyphen.
  2. We follow the restrictions from createElement() and createElementNS(), while ignoring that those are different from each other, from the HTML parser, and from what should be implemented for them per the latest XML specifications. (This requires no restriction to be specified and defers this mess to be cleaned up by the next generation, likely still us.)

@annevk
Copy link
Collaborator

annevk commented Mar 3, 2016

There was talk at some point for trying to see if we could lift constraints on names altogether, but I don't think that ever happened. @foolip was the last to touch that potato.

@rniwa
Copy link
Collaborator

rniwa commented Mar 3, 2016

Option 2 seems rather risky because we could end up allowing names that can't be processed by HTML/XML parser and we may not even know about it. So I think we should go with option 1 for now. It's easy to expand the set of letters we can use later.

@foolip
Copy link
Member

foolip commented Mar 3, 2016

The previous discussion was in a Mismatch between HTML parser and createElement() et al thread on blink-dev, spawned from a Inconsistency in characters allowed in attribute names between setAttribute and HTML syntax specs spec bug.

ASCII alpha + ASCII hyphen seems like the safer option, really.

@rniwa
Copy link
Collaborator

rniwa commented Mar 3, 2016

We should probably file an issue in HTML and figure out "the one definition". I'm more than happy to use this definition once it's ready (even in v1) but I don't want to hold up the custom elements API on that.

@annevk
Copy link
Collaborator

annevk commented Mar 3, 2016

"One or more a-z (lowercase), followed by a hyphen, followed by zero or more a-z (lowercase) or hyphen."

@rniwa
Copy link
Collaborator

rniwa commented Mar 3, 2016

Oh, I meant that the one definition that includes non-ASCII letters if that is even possible.

@annevk
Copy link
Collaborator

annevk commented Mar 3, 2016

I see, it really depends on what the requirements are. Does it need to be supported by the HTML parser? Does it need to be supported by createElement()? Does it need to be supported by createElementNS()? Do we want emojis? If the answer to the first three is yes, you could have NCName, plus the limitation that the first code point is an ASCII alpha, plus that it must contain a hyphen. If you also want emojis, it might be good for browsers to implement the latest version of XML, etc.

@rniwa
Copy link
Collaborator

rniwa commented Mar 3, 2016

Well, I'm saying that we should probably figure this out for document.createElement and HTML/XML parser first (that is, define what a valid name is for all cases in single definition) before deciding what to do for custom elements. That's precisely why I've been suggesting to use ASCII lowercase + hyphen for now.

@annevk
Copy link
Collaborator

annevk commented Mar 3, 2016

@rniwa oh, I don't think we can change HTML to allow non-a-z at the start. That would change the parsing of <†> and similar constructs. And we cannot change XML to require a-z at the start. We can have a common subset for custom elements, but we cannot have a common rule for all of them unless we start breaking things.

@rniwa
Copy link
Collaborator

rniwa commented Mar 3, 2016

I think you're still misunderstanding me. What I'm saying is that there should be one definition in one spec which defines what valid name means for HTML documents, which may refer to XML spec, and defines a set of valid names for HTML parser, XML parser, createDocument in HTML documents, and createDocument in XML documents. Hopefully there aren't many discrepancies between them but as you noted, they can't all be the same.

Now, if there is a known definitely safe subset of all those four potentially distinct sets that we can use for custom elements, then I'm all for it. But it sounded like there isn't, or they aren't even well defined yet. So it seems that we need to do the exercise of determining those four sets first before expanding the set of valid names allowed in custom elements

@domenic
Copy link
Collaborator

domenic commented Mar 8, 2016

I went with a liberal-as-possible intersection set in 35086b3. See https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-name for the rendered output.

We can work toward centralizing all definitions into one place (presumably DOM) later, and I think they will indeed all be distinct, but the definitions are already out there. I guess either DOM or browsers have a bug since DOM specifies XML 5th edition and browsers use XML 4th edition for createElement(NS). But for now custom elements will just use XML 5th edition like DOM does, and if we want to change both at once to align with browser reality (instead of making browsers more liberal) we can definitely do so.

@annevk
Copy link
Collaborator

annevk commented Mar 8, 2016

Would it not be easier to say it needs to match NCName plus these other restrictions? I'm not sure introducing a whole new production is helpful here.

@domenic
Copy link
Collaborator

domenic commented Mar 8, 2016

Hmm, I thought a production would be much easier to read/code against than taking a production and then using prose. The other restrictions get pretty hairy to the extent the new production is not really recognizable as a NCName.

@annevk
Copy link
Collaborator

annevk commented Mar 9, 2016

The main thing is that browsers have code for an "NCName" check and everyone is vaguely familiar with it given createElementNS (and it's only a character different from createElement). So placing additional requirements beyond "NCName" makes it

  1. Easier to see the delta.
  2. Likely easier to implement.

@domenic
Copy link
Collaborator

domenic commented Mar 9, 2016

I've added a non-normative note that should make it easier to see the delta. Hope that's clearer.

@annevk
Copy link
Collaborator

annevk commented Mar 9, 2016

Thanks, that helps.

@trusktr
Copy link
Contributor

trusktr commented Apr 15, 2016

By the way, if we're allowing exotic non-ASCII characters like emoji, we probably don't need the hyphen requirement in those tag names since the requirement exists for the forward compatibility with future HTML documents, and I don't think we'd ever add an HTML element with an emoji in its tag name.

Why are we so worried about custom element names conflicting with possible future tag names? (Not you specifically @rniwa

I propose that we should be allowed to override any element we wish, and in a per-shadow-root basis (not just on document):

// file1.js
import AwesomeImageElement from 'awesome-img'

const el = document.querySelector('#someEl')
const root = el.createShadowRoot()
root.registerElement('img', AwesomeImageElement)
const img = root.createElement('img') // creates an AwesomeImageElement instance
root.appendChild(img)
// file2.js
const el = document.querySelector('#otherEl')
const root = el.createShadowRoot()
const img = root.createElement('img') // creates an HTMLImageElement instance
root.appendChild(img)

If we allow overriding of native elements, then there will be no problem introducing native elements in the future; existing apps will continue to work, having their custom elements in place. It will also give developers more freedom and flexibility.

Please see the following threads for more details and examples:

@chaals
Copy link
Contributor

chaals commented Apr 15, 2016

@trusktr this comment is not really relevant to this issue. Having already raised the issue in question, please keep the technical discussion there, and avoid filling other issues with repeats of that information.

Places like twitter, blogs, and public discussions of ideas are other relevant places to look for support or discussion of your proposal. Filling up issue discussion isn't.

(If it were more relevant, a simple pointer would be enough. In this case, even that would probably be spammy).

Chaals (as chair)

@trusktr
Copy link
Contributor

trusktr commented Apr 15, 2016

Hello @chaals, thanks for the tip!

karlhorky added a commit to karlhorky/vscode-markdown-tm-grammar that referenced this issue Aug 18, 2019
Instead of requiring a name to match an existing HTML element, this relaxes the
restrictions to:

- starting with [a-zA-Z] (matching the HTML parser WICG/webcomponents#239 (comment))
- then continuing with anything other than a space, forward slash or closing
  angle bracket

This is similar to the fix to the following issue in the HTML syntax
highlighting repo (and actually depends on the "derivative" syntax that was
created for that issue):

textmate/html.tmbundle#92
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants