Skip to content
This repository has been archived by the owner on Sep 18, 2021. It is now read-only.

Behavior of characters outside of the Basic Multilingual Plane should be tested #28

Open
j3h opened this issue Jan 20, 2012 · 2 comments

Comments

@j3h
Copy link

j3h commented Jan 20, 2012

There are currently no conformance tests for characters outside of the Basic Multilingual Plane. These characters are significant because they are the only characters whose UTF-16 representation is more than one code unit. Many languages use UTF-16 as their native character encoding (including Java and JavaScript).

Anyway, this problem manifested itself as a difference between the output of the twitter-text-rb and twitter-text-java libraries. It affects all parts of the libraries that depend on counting characters. According to the Twitter documentation on counting characters, this codepoint should only count as one "character," but some libraries count it as two. This is particularly noticeable when looking at extracted entities from tweets that contain these characters.

This issue would include a pull request, but I haven't been able to figure out a way to encode one of those characters into YAML in a way that is recognized by common YAML libraries. As far as I understand the spec, directly embedding the character as UTF-8 should be allowed, but at least some Java YAML parsers won't accept that (valid) YAML document. In addition, YAML 32-bit Unicode escapes are not supported by most YAML libraries.

@keitaf
Copy link
Contributor

keitaf commented Jan 31, 2012

I created patches for twitter-text-java and twitter-text-js to count Unicode supplementary characters correctly.

twitter-archive/twitter-text-java#22
twitter-archive/twitter-text-js#39

And also added unit test cases to verify that twitter-text-rb correctly counts Unicode supplementary characters.
twitter-archive/twitter-text-rb#35

Can you pelase review them?

@j3h
Copy link
Author

j3h commented Jan 31, 2012

Note that twitter-text-js will also have to account for supplementary code points in its validation of the 140 character limit (i.e. isInvalidTweet).

I couldn't find length validation in twitter-text-java. Did I miss it, or is it not supported?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants