-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add toUSVString
/ toWellFormed
(alongside isUSVString
/ isWellFormed
)
#13
Comments
A possible enhancement, allowing the replacement to be user-determined: function toWellFormed(string, replacement = '\uFFFD') {
return string.replaceAll(/\p{Surrogate}/gu, replacement);
} |
I can envision use cases for both |
Note that the hypothetical sub-linear time performance benefit applies to If we assume that all |
That is the assumption I was making. I'm curious to know why it's not safe to make that assumption. |
There are many places in the web platform where we make a string well-formed before further processing it. (E.g., input to the URL parser.) Those code paths do not have separate branches for "is well-formed". I would assume userland to have similar cases. |
@annevk I don't think I follow the point you're trying to make. Assuming this proposal adds an efficient built-in function toWellFormed(string) {
return string.isWellFormed() ? string : string.replaceAll(/\p{Surrogate}/gu, '\uFFFD');
} Are you asking for a built-in |
I'm saying that the web platform has many code paths that do |
Okay, which of these better reflects your position?
Or am I still misunderstanding? |
function isWellFormed(string) {
return !/\p{Surrogate}/u.test(string);
} My point is I don’t understand the distinction you’re making between |
@mathiasbynens In the presence of an engine-optimised |
If the stated motivation of this proposal is entirely performance-based, perhaps it shouldn't advance unless there are actual commitments from engines to restructure their string representations to track this hypothetical bit? I'm not aware of any such plans at the moment. |
@domenic Agreed. I will make sure to discuss that during the presentation to committee. |
Personally I still want this for usability/clarity reasons, fwiw. The performance part is not important to me; the part I care about is that a reader who sees if (str.isWellFormedUnicode()) ... is much more likely to get what's going on than a reader who sees if (!/\p{Surrogate}/u.test(str)) ... since the latter requires a lot more background knowledge. That said, it is my understanding that some engines (at least V8) do already track a distinction between ASCII and other strings (or rather "strings whose UTF-16 code units are all < 256", which is not quite "ASCII"), and therefore can at least have a fast path for this API for ASCII strings, which is already better performance (in that common case) than userland is capable of. |
Yeah, I mainly see the benefit of these features in terms of clarity, not performance. And from that perspective you want both. (I don't like the Unicode suffix though. Strings are already Unicode and we don't really expose that term directly anywhere I think.) |
FYI I've opened #20 to add |
@annevk suggested this here: #11 (comment)
Having both methods be part of the same proposal makes sense to me.
Example
toWellFormed
userland implementation:The text was updated successfully, but these errors were encountered: