Behavior of characters outside of the Basic Multilingual Plane should be tested #28

j3h · 2012-01-20T01:31:11Z

There are currently no conformance tests for characters outside of the Basic Multilingual Plane. These characters are significant because they are the only characters whose UTF-16 representation is more than one code unit. Many languages use UTF-16 as their native character encoding (including Java and JavaScript).

Anyway, this problem manifested itself as a difference between the output of the twitter-text-rb and twitter-text-java libraries. It affects all parts of the libraries that depend on counting characters. According to the Twitter documentation on counting characters, this codepoint should only count as one "character," but some libraries count it as two. This is particularly noticeable when looking at extracted entities from tweets that contain these characters.

This issue would include a pull request, but I haven't been able to figure out a way to encode one of those characters into YAML in a way that is recognized by common YAML libraries. As far as I understand the spec, directly embedding the character as UTF-8 should be allowed, but at least some Java YAML parsers won't accept that (valid) YAML document. In addition, YAML 32-bit Unicode escapes are not supported by most YAML libraries.

keitaf · 2012-01-31T17:58:56Z

I created patches for twitter-text-java and twitter-text-js to count Unicode supplementary characters correctly.

twitter-archive/twitter-text-java#22
twitter-archive/twitter-text-js#39

And also added unit test cases to verify that twitter-text-rb correctly counts Unicode supplementary characters.
twitter-archive/twitter-text-rb#35

Can you pelase review them?

j3h · 2012-01-31T22:05:08Z

Note that twitter-text-js will also have to account for supplementary code points in its validation of the 140 character limit (i.e. isInvalidTweet).

I couldn't find length validation in twitter-text-java. Did I miss it, or is it not supported?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of characters outside of the Basic Multilingual Plane should be tested #28

Behavior of characters outside of the Basic Multilingual Plane should be tested #28

j3h commented Jan 20, 2012

keitaf commented Jan 31, 2012

j3h commented Jan 31, 2012

Behavior of characters outside of the Basic Multilingual Plane should be tested #28

Behavior of characters outside of the Basic Multilingual Plane should be tested #28

Comments

j3h commented Jan 20, 2012

keitaf commented Jan 31, 2012

j3h commented Jan 31, 2012