You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2021. It is now read-only.
There are currently no conformance tests for characters outside of the Basic Multilingual Plane. These characters are significant because they are the only characters whose UTF-16 representation is more than one code unit. Many languages use UTF-16 as their native character encoding (including Java and JavaScript).
Anyway, this problem manifested itself as a difference between the output of the twitter-text-rb and twitter-text-java libraries. It affects all parts of the libraries that depend on counting characters. According to the Twitter documentation on counting characters, this codepoint should only count as one "character," but some libraries count it as two. This is particularly noticeable when looking at extracted entities from tweets that contain these characters.
This issue would include a pull request, but I haven't been able to figure out a way to encode one of those characters into YAML in a way that is recognized by common YAML libraries. As far as I understand the spec, directly embedding the character as UTF-8 should be allowed, but at least some Java YAML parsers won't accept that (valid) YAML document. In addition, YAML 32-bit Unicode escapes are not supported by most YAML libraries.
The text was updated successfully, but these errors were encountered:
There are currently no conformance tests for characters outside of the Basic Multilingual Plane. These characters are significant because they are the only characters whose UTF-16 representation is more than one code unit. Many languages use UTF-16 as their native character encoding (including Java and JavaScript).
Anyway, this problem manifested itself as a difference between the output of the twitter-text-rb and twitter-text-java libraries. It affects all parts of the libraries that depend on counting characters. According to the Twitter documentation on counting characters, this codepoint should only count as one "character," but some libraries count it as two. This is particularly noticeable when looking at extracted entities from tweets that contain these characters.
This issue would include a pull request, but I haven't been able to figure out a way to encode one of those characters into YAML in a way that is recognized by common YAML libraries. As far as I understand the spec, directly embedding the character as UTF-8 should be allowed, but at least some Java YAML parsers won't accept that (valid) YAML document. In addition, YAML 32-bit Unicode escapes are not supported by most YAML libraries.
The text was updated successfully, but these errors were encountered: